Tiny Toy Network: Weights, Biases, and Graph

Layer 1: FullyConnected(2 → 2)

	h1	h2
x1	0.10	-0.20
x2	0.40	0.20

	h1	h2
bias	0.00	0.00

W1, b1 map x → z1 → h (sigmoid)

Layer 2: FullyConnected(2 → 2)

	y1	y2
h1	0.30	-0.10
h2	0.20	0.20

	y1	y2
bias	0.00	0.00

W2, b2 map h → z2 (linear output)

Computational Graph

Edges are labeled with weights; biases are shown as separate nodes feeding each neuron.

dE/dz2 = mse_prime(y_true, z2) = z2 - y_true

show dE/dz2 Also refreshes y_true and z2 values.

sample	x1	x2	z1	h (activation)	y_true (target)	z2 (output)	dE/dz2
0	0.00	0.00	[0, 0]	[0, 0]	[1, 0]	[0.28496, 0.13220]	[-0.71504, 0.13220]
1	0.00	0.00	[0, 0]	[0, 0]	[1, 0]	[0.30508, 0.13681]	[-0.69492, 0.13681]

dE/dx dE/dW1 dE/db1 dE/dz1 dE/dh dE/dW2 dE/db2

sample	Layer 1 input dE/dx	Layer 1 dE/dW (W1)	Layer 1 dE/db (b1)	Activation dE/dz1	Layer 2 input dE/dx (dE/dh)	Layer 2 dE/dW (W2)	Layer 2 dE/db (b2)
0	[0,0]	[[0,0],[0,0]]	[0,0]	[0,0]	[0,0]	[[0,0],[0,0]]	[0,0]
1	[0,0]	[[0,0],[0,0]]	[0,0]	[0,0]	[0,0]	[[0,0],[0,0]]	[0,0]

Backprop Cheat Sheet

Row-vector convention (1x2) matching the Python shapes.

z2 = h @ W2 + b2

dE/dz2 = z2 - y

dE/dW2 = h^T * dE/dz2 (so we get to 2x1 @ 1x2 ) (Kneusel refers to this as weights_error in NN.py)

dE/dh = dE/dz2 @ W2^T (Kneusel refers to this as input error in NN.py. We return this error to pass it backward as the input error for the previous layer.)

dE/dz1 = dE/dh * sigmoid'(z1). Sigmoid derivative is sigmoid(x)(1-sigmoid(x))

dE/dW1 = x^T * dE/dz1

dE/dx = dE/dz1 @ W1^T

Passing errors through the network

To pass the error back down the model, we calculate how the error term changes with a change to input of a layer using how the error changes with a change to a layer's output.

In this network, the order would be

dE/dx <- dE/dz1 <- dE/dh <- dE/dz2

What about for W and b?

Recall that while the above tells us how to pass the error term backward, the goal of backprop is to calculate how changes to W and b affect the error.

Why is dE/db = dE/dz? dE/db = dE/dy*dy//db = dE/dy * d(Wx+b)/db = dE/dy * (0+1)

Why is dE/dW = x^T * dE/dz? dE/dW = dE/dy * dy/dW = dE/dy * d(Wx+b)/dW = dE/dy * (x^t+0) = dE/dy * x^t

Inspired by Math for Deep Learning by Ronald T. Kneusel. Git Repo