How LLMs Actually Work: The Hidden State Is the Story
A visual mental model of Transformers: token IDs become hidden states, attention and MLPs rewrite them, and logits turn them back into text.
Most explanations of Transformers begin with the same sequence:
words -> tokens -> embeddings -> Q/K/V -> attention -> MLP -> logits -> next wordThis is not wrong, but it hides the most important object in the model.
If you stare at the parts one by one, it is easy to imagine that tokens turn into queries, keys, and values, which then somehow turn into the next token. That picture is close enough to sound plausible, but wrong enough to create confusion. Q, K, and V are not new entities that replace the token. Logits are not probabilities. The generated token does not arrive with its own Q, K, and V already attached.
The cleaner question is:
What is the thing that keeps moving through a Transformer?The answer is the hidden state.
More precisely, in modern Transformer language models, the main object is the residual stream: a sequence of vectors that gets repeatedly read, updated, and written back as the model moves through its layers.
The shortest useful mental model is this:
token IDs are the interface
hidden states are the computation
logits are the translation back into token spaceAn LLM is not really a machine that moves tokens through layers. It is a machine that repeatedly rewrites hidden states until the final state points toward the next token.
Tokens Are Addresses, Not Thoughts
A token is often described as a piece of text, but inside the model the input is not the text itself. The input is a token ID: an integer from the model’s vocabulary.
For example, a tokenizer might map a piece of text like this:
"Paris" -> 40313The exact number does not matter. What matters is that the number itself has no semantic structure the model can use. The model cannot think with integer IDs. It first has to turn that ID into a vector.
That is what the embedding table does.
token ID -> embedding vectorThe token ID is like an address. The embedding vector is the first object the model can compute with.
So the early flow is:
text -> token IDs -> embeddingsBut even “embedding” is not the full meaning of the token. It is only the starting point. The embedding is the initial hidden state:
h^0At this point, the token has a vector representation, but it has not yet been deeply interpreted in context. The word “is” in isolation is not the same thing as the “is” in:
The capital of France isThe job of the Transformer layers is to take that initial vector and repeatedly rewrite it until it becomes a context-aware state.
The Main Character: The Residual Stream
Once the model has embeddings, the central flow looks like this:
h^0 -> h^1 -> h^2 -> ... -> h^LEach Transformer layer takes in a hidden state and produces an updated hidden state.
But a layer usually does not throw away the old state and build a new one from scratch. It computes an update and adds that update back into the existing stream:
h_next = h_current + updateThat is the intuition behind the residual stream.
The residual stream is the model’s running workspace. Every layer reads from it. Every layer writes back into it. The hidden state at a position becomes a kind of living manuscript: each layer adds edits, annotations, retrieved context, and transformed features.
This matters because it changes how you think about the whole architecture.
The Transformer is not a pipeline where a token becomes Q/K/V, then becomes attention, then becomes an MLP output, then becomes a token again. The main continuity is the hidden state. Attention and MLPs are operations that update it.
LayerNorm and residual connections are support structures around this stream. Residual connections preserve the old state while adding new information. LayerNorm keeps the scale stable so the model can stack many layers without the stream numerically blowing up or collapsing.
So a better sketch of a layer is:
h
-> attention reads from h and writes an update
-> MLP reads from h and writes an update
-> new hThe hidden state is the story. The layers are revisions.
Attention: Communication Across Positions
Attention is where different token positions communicate.
Suppose the model is processing:
The capital of France isThe final position needs information from earlier positions. It needs to notice “capital” and “France” if it is going to predict “Paris”. Attention is the mechanism that lets one position pull information from other positions.
This is where Q, K, and V appear.
But the important point is that Q, K, and V are not separate things flowing through the model. They are temporary projections of the current hidden state.
At a given layer:
Q = h W_Q
K = h W_K
V = h W_VThese are three different views of the same underlying stream.
A useful intuition:
Q: what am I looking for?
K: what kind of information do I contain?
V: what information do I contribute if selected?The query compares against keys to produce attention weights:
attention scores = QK^T
attention weights = softmax(attention scores)Then those weights mix the values:
attention output = attention weights x VConceptually:
Q looks at K
K determines relevance
V provides the contentThe result is not “the next token.” The result is an update to the hidden state.
That update is written back into the residual stream, and the model continues.
So attention is best understood as communication across positions. It lets a token position gather context from other token positions. It is the model’s routing mechanism: what information should move from where to where?
MLP: Computation Inside Each Position
If attention is communication, the MLP is private computation.
Attention moves information across positions:
token position A can read from token position BThe MLP transforms information inside each position:
this position's h gets internally reshapedThis distinction is one of the most useful ways to understand a Transformer layer:
attention: horizontal communication across tokens
MLP: vertical transformation within featuresAfter attention has gathered information from the context, the MLP reshapes the hidden state. It can strengthen some features, suppress others, create nonlinear combinations, and prepare the representation for the next layer.
For the phrase:
The capital of France isattention may help the final position pull information from “capital” and “France”. The MLP can then transform that gathered information into features that point more strongly toward the continuation “Paris”.
This does not mean the MLP literally contains a clean dictionary entry like:
France capital -> ParisReal models are distributed and messy. But one useful deeper intuition is that MLPs often behave like pattern-triggered memory. If the current hidden state matches some learned pattern, the MLP writes a corresponding feature back into the residual stream.
That is why it is too weak to describe the MLP as “just a feedforward network.” In the flow of the model, it is the place where each position privately computes on the information it has gathered.
A Transformer block is therefore not just attention. It is a repeated alternation:
communicate across positions
compute inside each position
write the result back into hFrom Hidden State to Logits
After the final Transformer layer, the model has a final hidden state:
h^LBut this final hidden state is still not a token. It is a vector.
To produce text, the model has to translate that vector back into token space. It must assign a score to every possible token in the vocabulary.
That is the job of the LM head, sometimes called the unembedding layer.
h^L -> LM head -> logitsIf the vocabulary has 100,000 tokens, the logits vector has 100,000 numbers. Each number is the raw score for one candidate token.
Imagine a tiny vocabulary:
"Paris"
"London"
"apple"
"."
"runs"Given the context:
The capital of France isthe model might produce logits like:
Paris: 12.0
London: 4.1
apple: -2.3
.: 1.0
runs: -1.5These numbers are not probabilities. A logit of 12.0 does not mean 12 percent.
Logits are raw relative preference scores.
Their meaning lives in the differences. If “Paris” has logit 12.0 and “London” has logit 11.5, both are strong and close. If “Paris” has logit 12.0 and “London” has logit 4.1, the model strongly prefers “Paris”.
Softmax turns these raw scores into a probability distribution:
probability(token_i) = exp(logit_i) / sum(exp(all logits))After softmax, the same example might look like:
Paris: 0.997
London: 0.002
apple: 0.000
.: 0.001
runs: 0.000Then a decoding strategy chooses the next token.
If the model uses greedy decoding, it picks the highest-probability token. If it uses sampling, temperature, top-k, or top-p, it may choose a likely token without always choosing the maximum.
The important chain is:
final hidden state -> token scores -> probabilities -> selected token IDThe LM head is the output-side vocabulary map. It asks: given this final hidden state, which token directions does it point toward?
Generation Is a Loop
A language model generates text one token at a time.
After reading:
The capital of France isthe model produces a probability distribution and selects something like:
ParisAt that moment, the generated token is just a token ID. It does not come with Q, K, and V already attached.
To continue generation, the model appends that token ID to the context:
The capital of France is ParisThen the new token is embedded:
token ID -> embedding -> h^0It passes through the Transformer layers:
h^0 -> h^1 -> ... -> h^LAt each layer, the current hidden state produces fresh Q, K, and V for that layer:
Q^l = h^{l-1} W_Q^l
K^l = h^{l-1} W_K^l
V^l = h^{l-1} W_V^lThen the final hidden state produces logits, softmax gives a distribution, and the model selects the next token again.
So generation is a loop:
context tokens
-> hidden states
-> final hidden state
-> logits
-> next token ID
-> append token
-> repeatThe generated token becomes input for the next step. It is not a complete thought object by itself. It is an address that gets turned into a vector, then rewritten layer by layer into a context-aware hidden state.
Engineering Aside: KV Cache
KV cache is important in real inference, but it is an engineering optimization, not a different model.
Without a cache, each time the model generates a new token, it would have to recompute many quantities for all previous tokens.
But during autoregressive generation, previous tokens do not change. Their keys and values at each layer can be reused.
So each layer stores previous K and V vectors:
Layer 1 cache: K/V for previous tokens
Layer 2 cache: K/V for previous tokens
...
Layer L cache: K/V for previous tokensWhen a new token arrives, it computes its own Q, K, and V at each layer. Its Q reads from the cached K/V for previous tokens. Its new K/V are then appended to the cache for future tokens.
The reason Q is usually not cached is simple:
K/V are memory.
Q is the current query.Future tokens need the past K/V so they can read from the past. They do not need the past Q.
KV cache changes how much work must be recomputed. It does not change the conceptual loop:
new token ID -> hidden state -> Q/K/V at each layer -> updated hidden state -> logits -> next tokenThe Whole Model in One Picture
Here is the compact version:
token ID
|
v
embedding = h^0
|
v
Layer 1:
h -> Q/K/V -> attention update -> MLP update -> h^1
|
v
Layer 2:
h -> Q/K/V -> attention update -> MLP update -> h^2
|
v
...
|
v
final h^L
|
v
LM head
|
v
logits
|
v
softmax
|
v
next token distribution
|
v
selected next token IDAnd here is the same idea as a single sentence:
The Transformer repeatedly rewrites hidden states until the final hidden state points toward the next token.That is the core mechanism.
Tokens are the input and output interface. Q, K, and V are temporary views used for attention. MLPs transform each position’s hidden state. The residual stream carries the computation forward. The LM head translates the final hidden state into token scores.
If you keep one picture, keep this one:
token IDs enter
hidden states flow
layers write updates
logits score the vocabulary
one token is chosen
the loop begins againThe hidden state is the story.

