header with title

Table Of Contents


Preview of Backprop Explainer


Introduction



Backpropagation is one of the most important concepts in neural networks, however, it is challenging for learners to understand its concept because it is the most notation heavy part. Luckily, when the layers of notation are pealed back, the simplicity of backprop is revealed. Backprop is just a very simple process to tell us which parameters to change in a neural network.

For the remainder of the article, by marrying explanation, notation, and interactive tools, the aim is to get a understanding of the foundations. Note that throughout the article there will be highlighted words that will give extra explanation on mouse over.

Backprop on a Linear Problem


Getting Started

The goal in a neural network, or any optimization problem for that matter, is to lower and minimize whatever loss function we define. For this article, we will be performing regression and using mean squared error loss MSE.
Before hitting the calculus, tune the weight and bias on one neuron by dragging the slider. By observing changes in the loss, try to make it reach 0. When you feel like you've lowered the loss enough (or need some help), press the button below.

−1.0−0.8−0.6−0.4−0.20.00.20.40.60.81.0−1.0−0.8−0.6−0.4−0.20.00.20.40.60.81.0

Manually tune weight and bias and try to reach a loss of 0

-0.82-0.02

neuron(x)=\text{neuron}(x) = 0.82-0.82x+x + 0.02-0.02

loss =1Ji=1J(yi^yi)2= = \frac{1}{J}\sum_{i = 1}^J (\hat{y_i} - y_i)^2 = 0.43536500.4353650


Reflection

The intuition and logic from the exercise above, is the foundation for what backpropagation and optimization aims to achieve.

When you start to change the weight, you observe where the loss moves, if it moves up, you move the weight the other direction to lower the loss. This, without the math, is the main principle behind gauging the rate of change and optimizing a neural network. Let's apply this method of thinking!

Defining Backpropagation

First to get the instantaneous rate of change(derivative), we get our point of interest and another point that is infinitely close and calculate slope, this can be visualized as a tangent line at the point of interest for a function with one variable like f(x)f(x).

In the context of our one neuron neural network, we can compose the whole network as a nested function
loss(neuron(x,w,b),y)\text{loss}(\text{neuron}(x,w,b),y)
Since want to tune the parameters weight ww and bias bb, we want to know how each parameters will affect the loss, or in other words, the partial derivative of loss with respect to each parameter, also called the gradient. By computing the gradient of loss, we effectively have gauged how changing each parameter will affect the loss: the direction of steepest ascent. But we want to lower loss, therefore we use the opposite direction to get the direction of steepest descent, to perform gradient descent.

This step is can be visualized by graphing out a loss function. Let's graph the Mean Squared error loss function with same data that will show up later (stay tuned). The contour plot below is the result.
contour-explained
This plot shows the different losses represented by colors descending down from the greens, to the blues, to the purples, down to a loss of 0 in white. It also shows the weights represented by the x axis and biases represented by the y axis. For example if you have a weight=-1 and bias=5 the point of (-1, 5) on the plot, we see that the loss is 0.0 (white on the contour); that would be the place we want to eventually get to. You could also think of it as a physical hole where the colors represent depth. You might already see how we could take small steps down the hole till we reach the bottom where loss is minimized.

The idea of gradient descent for optimization is below starting with 1)

gradient-descent-explained
And thats all there is to gradient descent! 1) Start with a point, get the 2) steepest ascent, 3) flip it to reflect the steepest descent, then 4) take a step in that direction. After doing this enough times, we will reach the minimum loss possible.

The question then becomes, how can we calculate the gradient? To answer that, we need a bit of calculus to calculate all the derivatives with respect to the loss. Mainly you will need to use the chain rule from calculus because of the nature of the function we've composed.


Below is a color coded example of the chain rule. Start by sliding the slider and notice how the output is the input to the next function and so forth. Then read the explanation below.

34.000115.03024.77930.577
neuron1(34.000) = 0.42(34.000) + 0.75 = 15.030
neuron2(15.030) = 0.3(15.030) + 0.27 = 4.779
neuron3(4.779) = 0.06(4.779) + 0.29 = 0.577

Suppose we wanted to see how blue was affected by pink.
First lets start at blue, then
  1. observe how blue was affected by orange
  2. observe how orange was affected by green
  3. observe how green was affected by pink
By chaining these observations together, we get how how blue was affected by pink.

This logic applied to our one neuron neural network looks like
lossw=lossneuronneuronw\frac{ \partial \text{loss}}{\partial w} = \frac{ \partial \text{loss}}{\partial \text{neuron}} \frac{ \partial \text{neuron}}{\partial w}
lossb=lossneuronneuronb\frac{ \partial \text{loss}}{\partial b} = \frac{ \partial \text{loss}}{\partial \text{neuron}} \frac{ \partial \text{neuron}}{\partial b}
These chains can be broken up into more intermediate derivatives all the way down to their primitives (basis of automatic differentiation). The main takeaway is that we first observe how the loss output was affected by the neuron output lossneuron \frac{ \partial \text{loss}}{\partial \text{neuron}}, then we observe how the neuron output was affected by each parameter neuronparameter \frac{ \partial \text{neuron}}{\partial \text{parameter}} and can chain these to observe how the loss was affected by the parameter lossparameter \frac{ \partial \text{loss}}{\partial \text{parameter}}. Notice that we compute these derivatives going backward (where the term backpropagation comes from), which has the added benefit of reusing values computed in the forward propagation (for more on this check out reverse mode automatic differentiation).

Concrete Example

Let's go through a concrete example of forward propagation, then an emphasis on backward propagation. The training example will be (x=2.1x = 2.1, y=4y = 4), the weight will be w=1w = 1, and the bias will be b=0b = 0. y^\hat{y} represents the neuron output and predicted value, and loss\text{loss} represents squared error loss.

Forward Overview

Forward Computation
forward computation
Now we can go backward and compute partial derivatives with the chain rule to get the loss\nabla \text{loss}

Backward Overview

Backward Computation
backward computation
lossw=7.98\frac{\partial \text{loss}}{\partial w} = -7.98
lossb=3.8\frac{\partial \text{loss}}{\partial b} = -3.8
Then, we update the parameters with opposite gradient to descend loss, in this case learning rate is lr=0.01\text{lr} = 0.01
w:=wlrlossw=(1)(0.01)(7.98)=1.0798w := w - \text{lr} \cdot \frac{\partial \text{loss}}{\partial w} = (1) - (0.01) \cdot (-7.98) = 1.0798
b:=blrlossb=(0)(0.01)(3.8)=0.038b := b - \text{lr} \cdot \frac{\partial \text{loss}}{\partial b} = (0) - (0.01) \cdot (-3.8) = 0.038
To see how well our tuned parameters do, let's do one more forward pass
loss=((1.0798)(2.1)+0.038)4)2=2.87\text{loss} = ((1.0798)(2.1) + 0.038) - 4)^2 = 2.87
Total loss decrease of 3.612.87=0.743.61 - 2.87 = 0.74. Loss went down!

See it in Action

To see the what we just did (forward, backward, update) on more data and on the entire batch as opposed to a single training example, press to start the training process. Watch how as the line gets better fit, the loss decreases.
0.00.51.01.52.02.53.03.54.04.55.00.00.51.01.52.02.53.03.54.04.55.0

EPOCH: 0

neuron(x)=\text{neuron}(x) = 0.000.00x +  x\ + \ 0.000.00
−4−3−2−10123424681012141618none

Backprop on a Non-Linear Problem


The Changes

To fit more interesting data that is non-linear (e.g. sine wave or quadratic), we need to add complexity to vary output to make sure we are not constrained to only linear outputs. We can do this by adding more neurons per layer, more layers, and adding non-linearities (activation functions) to the outputs. If you think of our entire neural network as a function, then by adding more neurons and more layers we are creating a more nested function. Not only does this create more parameters that we can tune to vary the output, it also maintains the property of differentiability: important so we can compute the gradient. And by adding non-linear activation functions with points of deactivation, certain neurons may have no effect on the output while others may become more activated, contributing to outputs that don't have to follow linear constraints. We will be using the ReLU activation function in the hidden layers.
summary
Above is an example of a neural network with one input, three hidden layers with eight neurons each, and one output neuron. The output of each neuron is fed into the neurons of the next layer and so forth (like a nested function). Each link represents a weight and a corresponding input into the respective neuron: notice how the more neurons we add, the more connections there are and the more parameters we can tune to get our desired output.

Training Process

  1. Forward propagation resulting in an output and loss
  2. Backward propagation using the chain rule to compute the gradient
  3. Descend the loss by performing gradient descent
The process doesn't change from the single neuron example! Since the network is deeper, we have to calculate more derivatives going backwards and have to tune more parameters with gradient descent, but the logic stays the same.

A great way to visualize backpropagation in a large network is with vertical arrows representing which direction we need to nudge the neuron output in order to lower loss: lossactivation-\frac{\partial \text{loss}}{\partial \text{activation} }. In the neural network below, you will be able to visualize all phases of a single epoch with an emphasis on backpropagation. Read the instructions below to get a quick start!

Backprop Explainer Quick Start

  1. Press to start training
  2. Then press to see forward propagation, backward propagation, and update animation at the epoch #
  3. To go back to fitting mode click
Click on to reveal extra descriptions

key

Backprop Explainer

loss
−5−4−3−2−1012345−5−4−3−2−1012345
0.0−0.2−0.4−0.6−0.8−1.0
Control Center

EPOCH: 0

Customization
Learning Rate
0.0001
0.001
0.003
0.005
Data Set
sin
cos
tanh
Layer

Conclusion


By building up knowledge starting from one neuron (linear example), to multiple neurons with activation functions (non-linear example), it becomes apparent that backpropagation is just our way to decide what parameters need to be updated. Unsurprisingly, this is salient when we want to tune the parameters to lower loss.

Thankfully, Backpropagation is not just limited to these small examples, as far as neural networks reach, backpropagation will follow. Nowadays, no matter the problem, if there is data to learn from, someone will apply a neural network to it (for better or for worse).

Most Incredibly, through a simple backpropagation algorithm, learning becomes possible.

Acknowledgements



TensorFlow PlaygroundbyDaniel SmilkovShan Carter

  • Inspiration for the representation of the neural network
  • Adapted d3 code for the loss graph’s and best fit graphs axis scaling
  • Adapted css code for the links keyframe animation


What is backpropagation really doing?by3Blue1Brown

  • Inspiration to use arrows to represent where to nudge outputs to lower loss
  • Inspiration for weights color scheme


CNN explainerbyJay WangRobert TurkoOmar ShaikhHaekyu ParkNilaksh DasFred HohmanMinsuk KahngPolo Chau

  • Model for what good animation look like
  • Inspiration for the article format with components
  • Inspiration for the name backprop explainer


Communicating with Interactive ArticlesFred HohmanMatthew ConlenJeffrey HeerPolo Chau

  • Used colored labeling to toggle labels for notation


Who made this?

Created byDonald Bertucci andMinsuk Kahng.

Donald Bertucciis a freshman at Oregon State University. This work was done as part of theURSA Engage Program, an undergraduate research program for first and second year students, with the advice of Prof.Minsuk Kahng.

How was this made?

Made withd3.jsandreact.jsfor interactive visualization components,tensorflow.jsfor neural network training and fast math operations, andKATXE\KaTeXto render LATXE\LaTeX math equations.

Found any errors?

Please create an issue onif you found an issue in the article, any of the components, or in the Backprop Explainer.