Layer 1: FullyConnected(2 → 2)
| h1 | h2 | |
|---|---|---|
| x1 | 0.10 | -0.20 |
| x2 | 0.40 | 0.20 |
| h1 | h2 | |
|---|---|---|
| bias | 0.00 | 0.00 |
W1, b1 map x → z1 → h (sigmoid)
Layer 2: FullyConnected(2 → 2)
| y1 | y2 | |
|---|---|---|
| h1 | 0.30 | -0.10 |
| h2 | 0.20 | 0.20 |
| y1 | y2 | |
|---|---|---|
| bias | 0.00 | 0.00 |
W2, b2 map h → z2 (linear output)
Computational Graph
Edges are labeled with weights; biases are shown as separate nodes feeding each neuron.
dE/dz2 = mse_prime(y_true, z2) = z2 - y_true
Also refreshes y_true and z2 values.
| sample | x1 | x2 | z1 | h (activation) | y_true (target) | z2 (output) | dE/dz2 |
|---|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.00 | [0, 0] | [0, 0] | [1, 0] | [0.28496, 0.13220] | [-0.71504, 0.13220] |
| 1 | 0.00 | 0.00 | [0, 0] | [0, 0] | [1, 0] | [0.30508, 0.13681] | [-0.69492, 0.13681] |
| sample | Layer 1 input dE/dx | Layer 1 dE/dW (W1) | Layer 1 dE/db (b1) | Activation dE/dz1 | Layer 2 input dE/dx (dE/dh) | Layer 2 dE/dW (W2) | Layer 2 dE/db (b2) |
|---|---|---|---|---|---|---|---|
| 0 | [0,0] | [[0,0],[0,0]] | [0,0] | [0,0] | [0,0] | [[0,0],[0,0]] | [0,0] |
| 1 | [0,0] | [[0,0],[0,0]] | [0,0] | [0,0] | [0,0] | [[0,0],[0,0]] | [0,0] |
Backprop Cheat Sheet
Row-vector convention (1x2) matching the Python shapes.
z2 = h @ W2 + b2
dE/dz2 = z2 - y
dE/dW2 = h^T * dE/dz2 (so we get to 2x1 @ 1x2 ) (Kneusel refers to this as weights_error in NN.py)
dE/dh = dE/dz2 @ W2^T (Kneusel refers to this as input error in NN.py. We return this error to pass it backward as the input error for the previous layer.)
dE/dz1 = dE/dh * sigmoid'(z1). Sigmoid derivative is sigmoid(x)(1-sigmoid(x))
dE/dW1 = x^T * dE/dz1
dE/dx = dE/dz1 @ W1^T
Passing errors through the network
To pass the error back down the model, we calculate how the error term changes with a change to input of a layer using how the error changes with a change to a layer's output.
In this network, the order would be
- dE/dx <- dE/dz1 <- dE/dh <- dE/dz2
What about for W and b?
Recall that while the above tells us how to pass the error term backward, the goal of backprop is to calculate how changes to W and b affect the error.
Why is dE/db = dE/dz? dE/db = dE/dy*dy//db = dE/dy * d(Wx+b)/db = dE/dy * (0+1)
Why is dE/dW = x^T * dE/dz? dE/dW = dE/dy * dy/dW = dE/dy * d(Wx+b)/dW = dE/dy * (x^t+0) = dE/dy * x^t
Inspired by Math for Deep Learning by Ronald T. Kneusel. Git Repo