Backpropagation as Sensitivity Accounting
The Programmer’s Intuition
Most explanations of backpropagation begin with the chain rule. You are shown a network of circles, a few partial derivatives, and suddenly you are drowning in ∂L/∂w.
That is mathematically correct, but cognitively premature.
When you actually implement a neural network, the first useful question is not “what is the derivative?” but rather: “what does .grad actually mean in the code?”
If you can understand what that one field stores, the entire mechanism of deep learning demystifies itself. Backpropagation functions as a bookkeeping system rather than a mysterious calculus ritual.
1. The Meaning of .grad
Imagine a simple computation:
x = 5
loss = x * x # loss is 25What is x.grad?
Forget the value of loss. Forget the value of x.
x.grad answers one very specific question: If I stand at the current value of x, and nudge x slightly to the right, how violently does the loss react?
Let’s test it. If we nudge x by 0.01 (from 5 to 5.01), the new loss is 5.01 * 5.01 = 25.1001.
The loss changed by 0.1001.
The ratio of that change is:
loss_change / x_change = 0.1001 / 0.01 ≈ 10That ratio—10—is the gradient. It means that right now, at this exact spot, any movement in x is amplified by roughly 10 times in the final loss.
A gradient is just the steepness of the slope under your feet. It is a local sensitivity ratio.
2. The Naive Way: The “Nudge” Method
If a gradient is just a sensitivity ratio, why do we need backpropagation at all?
Why not just write a for loop? For every weight in our neural network, we could add 0.01, run the network forward, measure the change in loss, and divide by 0.01.
# The Naive Method
for w in weights:
original_loss = forward(weights)
w += 0.01
new_loss = forward(weights)
w.grad = (new_loss - original_loss) / 0.01
w -= 0.01 # resetThis works perfectly. It is mathematically sound and incredibly intuitive.
But it has a fatal flaw.
If your neural network has 1 parameter, you run the forward pass 1 time.
If your neural network has 1 billion parameters, you have to run the forward pass 1 billion times just to calculate the gradients for a single training step.
The naive method scales with the number of inputs. In deep learning, inputs (parameters) are massive, but the output (the loss) is always just a single number.
We need a way to calculate the sensitivity contribution of 1 billion parameters by running the network only once.
3. The Solution: Reverse-Mode Accounting
This is why backpropagation goes backward.
Instead of asking 1 billion times: “How does this specific weight affect the loss?”
We start at the loss and ask once: “Which earlier values can change me, and by how much?”
To do this, we need two things:
A Forward Pass to freeze the world. You can’t know the slope under your feet until you know where you are standing. The forward pass computes the data and builds a graph of who called whom.
Local Rules. Every basic operation (
+,*,relu) must know its own tiny sensitivity ratio.
For example, if z = a * b:
How sensitive is
ztoa? The ratio isb. (Ifz = a * 10, nudgingaby 1 changeszby 10. The sensitivity is just the other multiplier).How sensitive is
ztob? The ratio isa.
The operation doesn’t know anything about the neural network. It only knows its local rule.
4. The Core Mechanism: input.grad += local_grad * output.grad
Backpropagation is just walking the frozen graph backward, asking each operation to convert the sensitivity of its output into the sensitivity of its inputs.
Suppose we have a chain: x -> a -> loss.
And we are currently looking at the operation that produced a.
The downstream nodes have already figured out a.grad (how sensitive the final loss is to a).
Now, we need to find x.grad.
The logic is beautifully simple:
Sensitivity of loss to x = (Sensitivity of a to x) * (Sensitivity of loss to a)In code, this is:
x.grad += local_grad * a.gradFor a concrete chain:
a = x * 3
loss = a * aIf x = 2, then a = 6 and loss = 36.
The backward pass starts with loss.grad = 1.
Since loss = a * a, the local sensitivity of loss to a is 2 * a, so a.grad = 12.
Then a = x * 3, so the local sensitivity of a to x is 3, and x.grad = 3 * 12 = 36.
Why += and not =?
Because a single variable might influence the loss through multiple paths.
Consider y = x + x.
During backpropagation, the addition operation will say: “The left x contributed to y, and the right x contributed to y.”
If we used =, the second path would overwrite the first. By using +=, we accumulate the total gradient contribution of x across all the paths it took to reach the loss.
5. Learning is Stepping Downhill
Once the backward pass reaches the beginning, every single parameter in the network has a .grad value.
We now have a complete ledger. We know exactly how a tiny change in any of the 1 billion parameters will affect the final loss.
If w.grad is positive, increasing w increases the loss.
If w.grad is negative, increasing w decreases the loss.
Since our goal is to minimize the loss, we simply nudge every parameter in the opposite direction of its gradient.
for w in parameters:
w.data -= learning_rate * w.grad
w.grad = 0 # Clear the ledger for the next roundThe learning_rate just decides how big of a step we dare to take.
The Blindspot: What about Matrices?
If you look at real PyTorch code, you won’t see scalars like x = 5. You will see massive tensors and matrix multiplications.
Do not let matrices intimidate you. A matrix multiplication is just thousands of scalar multiplications and additions happening in parallel. The sensitivity accounting remains exactly the same: every single number in that matrix gets its own .grad, calculated by the exact same local rules, accumulated via +=.
Summary
Backpropagation acts as a ledger for sensitivity accounting.
Forward Pass: Compute the values and record the graph.
Local Rules: Every operation knows how its output reacts to its inputs.
Backward Pass: Start at the loss (
loss.grad = 1). Walk backward.Accumulate:
input.grad += local_grad * output.grad.Update: Move parameters against their gradient (
w -= lr * w.grad).
It is a bookkeeping trick that allows us to find the sensitivity of a billion parameters in a single sweep. And that trick is the engine of modern AI.

