What Normalization Really Removes
Shift, scale, and the unstable coordinate systems inside neural networks
Normalization is usually introduced as a formula:
x_normalized = (x - mean) / stdThe formula is correct, but it hides the idea.
Normalization is not mainly about making numbers “well-behaved.” It is about removing variations the model should not have to care about.
Mean removes shift. Standard deviation removes scale.
Mean defines the origin. Standard deviation defines the unit. After normalization, the next layer no longer reads raw values in an arbitrary coordinate system. It reads relative structure.
That is the point.
Numbers are ambiguous
Consider three vectors:
[1, 2, 3]
[101, 102, 103]
[10, 20, 30]As raw numbers, they are different.
But structurally, they share the same pattern:
low -> middle -> highThe second vector changed by shift. The third changed by scale.
If a downstream layer sees only raw values, it has to rediscover the same pattern under different origins and different units. In a deep network, this becomes more than inefficient. Each layer keeps changing the coordinate system seen by the next layer.
Normalization says: before interpreting the values, put them into their local coordinate system.
Subtracting the mean moves the center to zero.
Dividing by the standard deviation changes the unit from “raw value” to “local fluctuation size.”
So a normalized value no longer means:
this number is 120It means:
this value is 1.22 local units above its centerThat is a more stable thing for the next layer to read.
Standard deviation is the local unit
Take two vectors:
A = [8, 10, 12]
B = [80, 100, 120]For A:
mean = 10
deviations = [-2, 0, 2]
std ≈ 1.63For B:
mean = 100
deviations = [-20, 0, 20]
std ≈ 16.33B’s deviations are ten times larger, but its standard deviation is also ten times larger. After normalization, both become approximately:
[-1.22, 0, 1.22]12 and 120 are not the same in raw space. But inside their own local coordinate systems, they play the same role.
Both are high by the same relative amount.
Normalization does not preserve absolute magnitude. It preserves relative structure.
Who shares the ruler?
A better way to understand normalization is to ask:
Which numbers should share the same ruler?Every normalization method answers this differently.
BatchNorm says the same channel, across examples and spatial positions, can share a ruler.
LayerNorm says the dimensions inside one sample, or one token representation, can share a ruler.
GroupNorm says a group of channels inside one sample can share a ruler.
InstanceNorm says one image, one channel, across spatial positions, can share a ruler.
So normalization is not just a numerical operation. It is a modeling assumption about comparability.
If the selected group of numbers is meaningfully comparable, normalization helps. If not, it can erase useful information.
Why BatchNorm came first
From today’s Transformer-centered world, LayerNorm feels obvious. Each token has a hidden vector. Normalize that vector. Done.
So why did BatchNorm come first?
Because the dominant architecture in 2015 was the convolutional neural network.
In a CNN, a channel behaves like a detector:
channel 1: edge detector
channel 2: texture detector
channel 3: color contrast detectorBatchNorm normalizes each channel across the batch and spatial locations. For one channel, it asks:
How strongly is this detector firing across many images and positions?That is a natural ruler. An edge detector firing on cats, cars, and houses is still an edge detector. Its responses can reasonably be compared.
BatchNorm came first not because it was more fundamental, but because it fit the architecture that mattered most at the time.
The model made the intuition obvious.
Why LayerNorm fits Transformers
Transformers have a different basic object.
A CNN is organized around channels as detectors.
A Transformer is organized around token states.
Each token has a hidden vector:
token -> hidden stateThat hidden state represents the token’s current contextual state. In language, it is not very natural to normalize one hidden dimension across unrelated tokens in unrelated sentences.
A batch might contain:
"The bank raised rates."
"I sat by the river bank."
"Because ..."
"<padding>"Across batch and token positions, the same hidden dimension may mix different meanings, positions, contexts, and padding artifacts.
LayerNorm asks a more local question:
For this token, in this layer, what is the internal scale of its hidden state?That is why LayerNorm fits Transformers. It does not depend on other examples in the batch. It behaves consistently during training and autoregressive generation.
Each token gets its own ruler.
The residual stream problem
LayerNorm matters even more because Transformers are residual networks.
A Transformer layer roughly does this:
x_next = x + updateLayer after layer, information is added into the residual stream.
Without normalization, each layer can change the scale of that stream. Worse, the update produced by a layer often depends on the scale of its input.
That creates a feedback loop:
large x -> large update -> even larger xor the opposite:
small x -> small update -> weaker future updatesAcross dozens or hundreds of layers, small scale errors compound.
Pre-LayerNorm changes the read-write pattern:
x_next = x + F(LayerNorm(x))LayerNorm does not erase the residual stream. The stream can still accumulate information.
What LayerNorm does is make sure each sublayer reads from a calibrated version of the stream before writing back into it.
Without LayerNorm:
big x -> big update -> bigger xWith Pre-LayerNorm:
big x -> normalized x -> controlled updateLayerNorm is not just cleaning up activations. It stabilizes the interface between reading from and writing to the residual stream.
RMSNorm shows what matters most
RMSNorm is interesting because it removes part of LayerNorm.
LayerNorm does two things:
x - mean removes shift
/ std removes scaleRMSNorm keeps mostly the second part:
x / rms(x)It does not center the vector around zero. It mainly controls the vector’s length.
That is a clue.
If the whole point of normalization were “make every vector zero-mean and unit-variance,” RMSNorm would look like a broken version of LayerNorm. But many modern language models use it successfully.
So what does RMSNorm reveal?
It suggests that, in large Transformers, the most dangerous instability is often not shift. It is scale.
The residual stream keeps accumulating information across layers. If its scale grows, attention logits, MLP activations, and residual updates become harder to control. A layer may produce a larger update simply because it read a larger vector, not because it found a more meaningful pattern.
RMSNorm attacks that problem directly.
It says:
Keep the direction of the representation, but control its length.That is a smaller intervention than LayerNorm. It does not force the hidden state to be centered. It just prevents the next sublayer from reading the residual stream at an arbitrary volume.
RMSNorm matters because it separates two ideas that LayerNorm bundles together.
Mean subtraction removes shift.
Scale normalization controls length.
Modern Transformers suggest that controlling length may be the load-bearing part.
That does not make LayerNorm wrong. It makes the principle clearer: normalization is not a ritual. It is a targeted fix for the coordinate instability your architecture actually suffers from.
The core insight
Neural networks do not merely process values. They process values inside coordinate systems.
If the coordinate system keeps shifting and scaling, the next layer has to solve two problems at once:
What is the pattern?
What coordinate system is it expressed in today?Normalization removes part of the second problem.
Mean sets the origin.
Standard deviation sets the unit.
LayerNorm chooses the token’s own hidden state as the local ruler.
That is the real intuition:
Normalization removes coordinate noise so the model can read structure.
LayerNorm works in Transformers because each token needs to be read in its own local coordinate system.
