[xnought.github.io/backprop-explainer]

Preview of Backprop Explainer

Introduction

Backpropagation is one of the most important concepts in neural networks, however, it is challenging for learners to understand its concept because it is the most notation heavy part. Luckily, when the layers of notation are pealed back, the simplicity of backprop is revealed. Backprop is just a very simple process to tell us which parameters to change in a neural network.

For the remainder of the article, by marrying explanation, notation, and interactive tools, the aim is to get a understanding of the foundations. Note that throughout the article there will be highlighted words that will give extra explanation on mouse over.

Backprop on a Linear Problem

Getting Started

The goal in a neural network, or any optimization problem for that matter, is to lower and minimize whatever loss function we define. For this article, we will be performing regression and using mean squared error loss MSE.

Before hitting the calculus, tune the weight and bias on one neuron by dragging the slider. By observing changes in the loss, try to make it reach 0. When you feel like you've lowered the loss enough (or need some help), press the button below.

Manually tune weight and bias and try to reach a loss of 0

-0.82-0.02

$\text{neuron}(x) =$ $-0.82$ $x +$ $-0.02$

loss $= \frac{1}{J}\sum_{i = 1}^J (\hat{y_i} - y_i)^2 =$ $0.4353650$

Reflection

The intuition and logic from the exercise above, is the foundation for what backpropagation and optimization aims to achieve.

When you start to change the weight, you observe where the loss moves, if it moves up, you move the weight the other direction to lower the loss. This, without the math, is the main principle behind gauging the rate of change and optimizing a neural network. Let's apply this method of thinking!

Defining Backpropagation

First to get the instantaneous rate of change(derivative), we get our point of interest and another point that is infinitely close and calculate slope, this can be visualized as a tangent line at the point of interest for a function with one variable like $f(x)$ .

In the context of our one neuron neural network, we can compose the whole network as a nested function
$\text{loss}(\text{neuron}(x,w,b),y)$

Since want to tune the parameters weight $w$ and bias $b$ , we want to know how each parameters will affect the loss, or in other words, the partial derivative of loss with respect to each parameter, also called the gradient. By computing the gradient of loss, we effectively have gauged how changing each parameter will affect the loss: the direction of steepest ascent. But we want to lower loss, therefore we use the opposite direction to get the direction of steepest descent, to perform gradient descent.

This step is can be visualized by graphing out a loss function. Let's graph the Mean Squared error loss function with same data that will show up later (stay tuned). The contour plot below is the result.

This plot shows the different losses represented by colors descending down from the greens, to the blues, to the purples, down to a loss of 0 in white. It also shows the weights represented by the x axis and biases represented by the y axis. For example if you have a weight=-1 and bias=5 the point of (-1, 5) on the plot, we see that the loss is 0.0 (white on the contour); that would be the place we want to eventually get to. You could also think of it as a physical hole where the colors represent depth. You might already see how we could take small steps down the hole till we reach the bottom where loss is minimized.

The idea of gradient descent for optimization is below starting with 1)

And thats all there is to gradient descent! 1) Start with a point, get the 2) steepest ascent, 3) flip it to reflect the steepest descent, then 4) take a step in that direction. After doing this enough times, we will reach the minimum loss possible.

The question then becomes, how can we calculate the gradient? To answer that, we need a bit of calculus to calculate all the derivatives with respect to the loss. Mainly you will need to use the chain rule from calculus because of the nature of the function we've composed.

Below is a color coded example of the chain rule. Start by sliding the slider and notice how the output is the input to the next function and so forth. Then read the explanation below.

neuron1(34.000) = 0.13(34.000) + 0.62 = 5.040

neuron2(5.040) = 0.51(5.040) + 0.21 = 2.780

neuron3(2.780) = 0.74(2.780) + 0.03 = 2.087

**Suppose we wanted to see how blue was affected by pink.**
First lets start at blue, then
observe how blue was affected by orange
observe how orange was affected by green
observe how green was affected by pink
By chaining these observations together, we get how how blue was affected by pink.

This logic applied to our one neuron neural network looks like

$\frac{ \partial \text{loss}}{\partial w} = \frac{ \partial \text{loss}}{\partial \text{neuron}} \frac{ \partial \text{neuron}}{\partial w}$
$\frac{ \partial \text{loss}}{\partial b} = \frac{ \partial \text{loss}}{\partial \text{neuron}} \frac{ \partial \text{neuron}}{\partial b}$

These chains can be broken up into more intermediate derivatives all the way down to their primitives (basis of automatic differentiation). The main takeaway is that we first observe how the loss output was affected by the neuron output $\frac{ \partial \text{loss}}{\partial \text{neuron}}$ , then we observe how the neuron output was affected by each parameter $\frac{ \partial \text{neuron}}{\partial \text{parameter}}$ and can chain these to observe how the loss was affected by the parameter $\frac{ \partial \text{loss}}{\partial \text{parameter}}$ . Notice that we compute these derivatives going backward (where the term backpropagation comes from), which has the added benefit of reusing values computed in the forward propagation (for more on this check out reverse mode automatic differentiation).

Concrete Example

Let's go through a concrete example of forward propagation, then an emphasis on backward propagation. The training example will be ( $x = 2.1$ , $y = 4$ ), the weight will be $w = 1$ , and the bias will be $b = 0$ . $\hat{y}$ represents the neuron output and predicted value, and $\text{loss}$ represents squared error loss.

Forward Overview

Forward Computation

Now we can go backward and compute partial derivatives with the chain rule to get the $\nabla \text{loss}$

Backward Overview

Backward Computation

$\frac{\partial \text{loss}}{\partial w} = -7.98$
$\frac{\partial \text{loss}}{\partial b} = -3.8$

Then, we update the parameters with opposite gradient to descend loss, in this case learning rate is $\text{lr} = 0.01$
$w := w - \text{lr} \cdot \frac{\partial \text{loss}}{\partial w} = (1) - (0.01) \cdot (-7.98) = 1.0798$
$b := b - \text{lr} \cdot \frac{\partial \text{loss}}{\partial b} = (0) - (0.01) \cdot (-3.8) = 0.038$
To see how well our tuned parameters do, let's do one more forward pass
$\text{loss} = ((1.0798)(2.1) + 0.038) - 4)^2 = 2.87$

Total loss decrease of $3.61 - 2.87 = 0.74$ . Loss went down!

See it in Action

To see the what we just did (forward, backward, update) on more data and on the entire batch as opposed to a single training example, press to start the training process. Watch how as the line gets better fit, the loss decreases.

EPOCH: 0

$\text{neuron}(x) =$ $0.00$ $x\ + \$ $0.00$

Backprop on a Non-Linear Problem

The Changes

To fit more interesting data that is non-linear (e.g. sine wave or quadratic), we need to add complexity to vary output to make sure we are not constrained to only linear outputs. We can do this by adding more neurons per layer, more layers, and adding non-linearities (activation functions) to the outputs. If you think of our entire neural network as a function, then by adding more neurons and more layers we are creating a more nested function. Not only does this create more parameters that we can tune to vary the output, it also maintains the property of differentiability: important so we can compute the gradient. And by adding non-linear activation functions with points of deactivation, certain neurons may have no effect on the output while others may become more activated, contributing to outputs that don't have to follow linear constraints. We will be using the ReLU activation function in the hidden layers.

Above is an example of a neural network with one input, three hidden layers with eight neurons each, and one output neuron. The output of each neuron is fed into the neurons of the next layer and so forth (like a nested function). Each link represents a weight and a corresponding input into the respective neuron: notice how the more neurons we add, the more connections there are and the more parameters we can tune to get our desired output.

Training Process

Forward propagation resulting in an output and loss
Backward propagation using the chain rule to compute the gradient
Descend the loss by performing gradient descent

The process doesn't change from the single neuron example! Since the network is deeper, we have to calculate more derivatives going backwards and have to tune more parameters with gradient descent, but the logic stays the same.

A great way to visualize backpropagation in a large network is with vertical arrows representing which direction we need to nudge the neuron output in order to lower loss: $-\frac{\partial \text{loss}}{\partial \text{activation} }$ . In the neural network below, you will be able to visualize all phases of a single epoch with an emphasis on backpropagation. Read the instructions below to get a quick start!

Backprop Explainer Quick Start

Press to start training
Then press to see forward propagation, backward propagation, and update animation at the epoch #
To go back to fitting mode click

Click on to reveal extra descriptions

Backprop Explainer

Control Center

EPOCH: 0

Customization

Learning Rate

0.0001

0.001

0.003

0.005

Data Set

sin

cos

tanh

Layer

Conclusion

By building up knowledge starting from one neuron (linear example), to multiple neurons with activation functions (non-linear example), it becomes apparent that backpropagation is just our way to decide what parameters need to be updated. Unsurprisingly, this is salient when we want to tune the parameters to lower loss.

Thankfully, Backpropagation is not just limited to these small examples, as far as neural networks reach, backpropagation will follow. Nowadays, no matter the problem, if there is data to learn from, someone will apply a neural network to it (for better or for worse).

Most Incredibly, through a simple backpropagation algorithm, learning becomes possible.

Acknowledgements

TensorFlow PlaygroundbyDaniel Smilkov Shan Carter

Inspiration for the representation of the neural network
Adapted d3 code for the loss graph’s and best fit graphs axis scaling
Adapted css code for the links keyframe animation

What is backpropagation really doing?by3Blue1Brown

Inspiration to use arrows to represent where to nudge outputs to lower loss
Inspiration for weights color scheme

CNN explainerbyJay Wang Robert Turko Omar Shaikh Haekyu Park Nilaksh Das Fred Hohman Minsuk Kahng Polo Chau

Model for what good animation look like
Inspiration for the article format with components
Inspiration for the name backprop explainer

Communicating with Interactive Articles Fred Hohman Matthew Conlen Jeffrey Heer Polo Chau

Used colored labeling to toggle labels for notation

Who made this?

Created byDonald Bertucci andMinsuk Kahng.

Donald Bertucciis a freshman at Oregon State University. This work was done as part of theURSA Engage Program, an undergraduate research program for first and second year students, with the advice of Prof.Minsuk Kahng.

How was this made?

Made withd3.jsandreact.jsfor interactive visualization components,tensorflow.jsfor neural network training and fast math operations, and $\KaTeX$ to render $\LaTeX$ math equations.

Found any errors?

Please create an issue onif you found an issue in the article, any of the components, or in the Backprop Explainer.

Table Of Contents

Introduction

Backprop on a Linear Problem

Getting Started

The goal in a neural network, or any optimization problem for that matter, is to lower and minimize whatever loss function we define. For this article, we will be performing regression and using mean squared error loss MSE.

Before hitting the calculus, tune the weight and bias on one neuron by dragging the slider. By observing changes in the loss, try to make it reach 0. When you feel like you've lowered the loss enough (or need some help), press the Click to reveal the graph button below.

neuron(x)=\text{neuron}(x) = neuron(x)= −0.82-0.82−0.82x+x + x+ −0.02-0.02−0.02

loss =1J∑i=1J(yi^−yi)2= = \frac{1}{J}\sum_{i = 1}^J (\hat{y_i} - y_i)^2 ==J1​∑i=1J​(yi​^​−yi​)2= 0.43536500.43536500.4353650

Reflection

The intuition and logic from the exercise above, is the foundation for what backpropagation and optimization aims to achieve.

When you start to change the weight, you observe where the loss moves, if it moves up, you move the weight the other direction to lower the loss. This, without the math, is the main principle behind gauging the rate of change and optimizing a neural network. Let's apply this method of thinking!

Defining Backpropagation

First to get the instantaneous rate of change(derivative), we get our point of interest and another point that is infinitely close and calculate slope, this can be visualized as a tangent line at the point of interest for a function with one variable like f(x)f(x)f(x).

In the context of our one neuron neural network, we can compose the whole network as a nested functionloss(neuron(x,w,b),y)\text{loss}(\text{neuron}(x,w,b),y)loss(neuron(x,w,b),y)

This step is can be visualized by graphing out a loss function. Let's graph the Mean Squared error loss function with same data that will show up later (stay tuned). The contour plot below is the result.

The idea of gradient descent for optimization is below starting with 1)

And thats all there is to gradient descent! 1) Start with a point, get the 2) steepest ascent, 3) flip it to reflect the steepest descent, then 4) take a step in that direction. After doing this enough times, we will reach the minimum loss possible.

The question then becomes, how can we calculate the gradient? To answer that, we need a bit of calculus to calculate all the derivatives with respect to the loss. Mainly you will need to use the chain rule from calculus because of the nature of the function we've composed.

Below is a color coded example of the chain rule. Start by sliding the slider and notice how the output is the input to the next function and so forth. Then read the explanation below.

neuron1(34.000) = 0.13(34.000) + 0.62 = 5.040

neuron2(5.040) = 0.51(5.040) + 0.21 = 2.780

neuron3(2.780) = 0.74(2.780) + 0.03 = 2.087

Suppose we wanted to see how blue was affected by pink.First lets start at blue, thenobserve how blue was affected by orangeobserve how orange was affected by greenobserve how green was affected by pinkBy chaining these observations together, we get how how blue was affected by pink.

This logic applied to our one neuron neural network looks like

Concrete Example

Forward Overview

Forward Computation

Now we can go backward and compute partial derivatives with the chain rule to get the ∇loss\nabla \text{loss}∇loss

Backward Overview

Backward Computation

∂loss∂w=−7.98\frac{\partial \text{loss}}{\partial w} = -7.98∂w∂loss​=−7.98∂loss∂b=−3.8\frac{\partial \text{loss}}{\partial b} = -3.8∂b∂loss​=−3.8

Total loss decrease of 3.61−2.87=0.743.61 - 2.87 = 0.743.61−2.87=0.74. Loss went down!

See it in Action

To see the what we just did (forward, backward, update) on more data and on the entire batch as opposed to a single training example, press to start the training process. Watch how as the line gets better fit, the loss decreases.

EPOCH: 0

neuron(x)=\text{neuron}(x) = neuron(x)=0.000.000.00x + x\ + \ x + 0.000.000.00

Backprop on a Non-Linear Problem

The Changes

Training Process

Forward propagation resulting in an output and lossBackward propagation using the chain rule to compute the gradientDescend the loss by performing gradient descent

The process doesn't change from the single neuron example! Since the network is deeper, we have to calculate more derivatives going backwards and have to tune more parameters with gradient descent, but the logic stays the same.

Backprop Explainer Quick Start

Press to start trainingThen press Click to animate epoch # to see forward propagation, backward propagation, and update animation at the epoch #To go back to fitting mode click Go back to fitting

Click on to reveal extra descriptions

Backprop Explainer

EPOCH: 0

Conclusion

Acknowledgements

Who made this?

How was this made?

Found any errors?

Before hitting the calculus, tune the weight and bias on one neuron by dragging the slider. By observing changes in the loss, try to make it reach 0. When you feel like you've lowered the loss enough (or need some help), press the button below.

$\text{neuron}(x) =$ $-0.82$ $x +$ $-0.02$

loss $= \frac{1}{J}\sum_{i = 1}^J (\hat{y_i} - y_i)^2 =$ $0.4353650$

First to get the instantaneous rate of change(derivative), we get our point of interest and another point that is infinitely close and calculate slope, this can be visualized as a tangent line at the point of interest for a function with one variable like $f(x)$ .

In the context of our one neuron neural network, we can compose the whole network as a nested function
$\text{loss}(\text{neuron}(x,w,b),y)$

**Suppose we wanted to see how blue was affected by pink.**
First lets start at blue, then
observe how blue was affected by orange
observe how orange was affected by green
observe how green was affected by pink
By chaining these observations together, we get how how blue was affected by pink.

Now we can go backward and compute partial derivatives with the chain rule to get the $\nabla \text{loss}$

$\frac{\partial \text{loss}}{\partial w} = -7.98$
$\frac{\partial \text{loss}}{\partial b} = -3.8$

Total loss decrease of $3.61 - 2.87 = 0.74$ . Loss went down!

$\text{neuron}(x) =$ $0.00$ $x\ + \$ $0.00$

Forward propagation resulting in an output and loss
Backward propagation using the chain rule to compute the gradient
Descend the loss by performing gradient descent

Press to start training
Then press to see forward propagation, backward propagation, and update animation at the epoch #
To go back to fitting mode click