Neural Networks Tutorial, Part #3

Posted on December 13, 2007
Filed Under Algorithms |

In our previous tutorial we’ve laid out the basic form of our two layer feed-forward neural network (FFNN). In this installment we’ll derive a way of training it. Just to remind you, here is the basic outline of our neural network, along with all relevant variables:

A Two-Layer Feed Forward Neural Network

In general, when training a network, one prepares a set of inputs, x^\mu and a corresponding set of outputs, z^\mu. The superscript \mu denotes an input-output set. Note that each x^\mu is a vector, with components x_i^\mu:

x^\mu = \left( \begin{array}{c} x_1^\mu \\ x_2^\mu \\ . . . \\ x_N^\mu \end{array} \right)

In this example, N denotes the number of inputs.

In reality, however, when we feed the x^\mu's into our FFNN, the corresponding outputs \bar{s}^\mu don’t often match our expectations, z^\mu. We can quantity the total error from comparing all possible outputs to the desirable outputs by defining an error function as follows:

E = \frac{1}{2} \sum_{\mu, k} \left( z_k^\mu - \bar{s}_k^\mu \right)^2

The double sum here ranges over all input-output pairs, and over all components. We would like to minimize E - ideally make it zero, or at least as small as posible. We do this by employing a gradient descent method - at each iteration of our algorithm we compute the derivatives of E with respect to the bias vectors and weight matrices, and slightly change them in a manner proportional to the derivatives. In other words:

\begin{align}\delta w_{ij} = - \epsilon \frac{ \partial E}{\partial w_{ij}} \\ \delta \bar{w}_{ij} = - \epsilon \frac{ \partial E}{\partial \bar{w}_{ij}}  \\ \delta b_i = - \epsilon \frac{ \partial E}{\partial b_i} \\ \delta \bar{b}_i = - \epsilon \frac{ \partial E}{\partial \bar{b}_i} \end{align}

where \epsilon is some small number, the choice of which is somewhat of an art and some trial-and-error (more on that later). This method of adjusting our weights and biases is termed backpropagation, for a reason that will be made evident at the end of this post. Meanwhile, we need to get down and busy with computing the above derivatives. Be warned, things are going to get quite messy. For a firm understanding, I suggest following them on your own with pen and paper.

Computing the Derivatives: Output Layer

Our main tool in this computation will be the chain rule, which I’ll assume you’re familiar with. Let us start with the “easier” derivatives first:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{k,\mu} \left( \bar{s}_k^\mu - z_k^\mu \right) \frac{\partial \bar{s}_k^\mu}{\partial \bar{w}_{ij}}

Since \bar{s}_k^\mu = \bar{f} (\bar{h}_k^\mu}), we employ our chain rule as follows:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{k,\mu} \left( \bar{s}_k^\mu - z_k^\mu \right) \bar{f}'(\bar{h}_k^\mu) \frac{\partial \bar{h}_k^\mu}{\partial \bar{w}_{ij}}

Since \bar{h}_k^\mu = \sum_p \bar{w}_{kp} s_p + \bar{b}_k, we can immediately infer  \frac{\partial \bar{h}_k^\mu}{\partial \bar{w}_{ij}} = \delta_{ki} s_j, where \delta_{ij} is the Kronecker delta function, equal to 1 if i=j and 0 otherwise. Substituting and summing over k:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{k,\mu} \left( \bar{s}_k^\mu - z_k^\mu \right) \bar{f}'(\bar{h}_k^\mu)   \delta_{ki} s_j  = \sum_{\mu} \left( \bar{s}_i^\mu - z_i^\mu \right) \bar{f}'(\bar{h}_i^\mu) s_j

The quantities that appear in this equation are all directly computable by taking the input (for a particular \mu) and propagating it forwards in our neural networks. We’ll see how it’s actually done in an upcoming installment when we implement things in MATLAB, but for the time being, I’d just like to introduce a notation:

\bar{e}_k^\mu \equiv \left( \bar{s}_k^\mu - z_k^\mu \right)

This is the error in the output. Using this definition:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{\mu} \bar{e}_i^\mu \bar{f}'(\bar{h}_i^\mu) s_j

The computation of the bias is done similarly, and is even easier (you should try it yourself before reading further):

\frac{\partial E}{\partial \bar{b}_i} = \sum_{\mu_k} \bar{e}_k^\mu \bar{f}'(\bar{h}_k^\mu) \frac{\partial \bar{h}_k^\mu}{\partial b_i} = \sum_\mu \bar{e}_i^\mu \bar{f}'(\bar{h}_i^\mu)

where we have used:

\frac{\partial \bar{h}_k^\mu}{\partial \bar{b}_i} = \delta_{ki}

Computing the Derivatives: Input Layer

We now turn to the task of computing the derivatives relating to the first, input layer. The method is similar, only now we need to apply the chain rule twice. Here is how it’s done. The first step is a rehash of the computations we did above:

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu, k} \left(  z_k^\mu - \bar{s}_k^\mu \right) \frac{\partial \bar{s}_k^\mu}{\partial w_{ij}}

Proceeding as above,

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu, k} \left(  z_k^\mu - \bar{s}_k^\mu \right) \bar{f}' \left( \bar{h}_k^\mu \right) \frac{\partial \bar{h}_k^\mu}{\partial w_{ij}}

We now re-apply our chain rule as follows: first of all, note that

\bar{h}_k^\mu = \sum_p \bar{w}_{kp} s_p + \bar{b}_k \\ s_p^\mu = f(h_p^\mu) = f \left( \sum_q w_{pq} h_q^\mu + b_p \right)

and that

\frac{\partial h_p^\mu }{\partial w_{ij}} =  \frac{\partial }{\partial w_{ij}}  \left( \sum_q w_{pq} x_q^\mu + b_p \right) =  \delta_{pi} x_j^\mu

and hence:

\frac{\partial \bar{h}_k^\mu}{\partial w_{ij}} = \sum_p \bar{w}_{kp} \frac{\partial \bar{s}_p^\mu}{\partial w_{ij}} = \sum_p \bar{w}_{kp} f' \left( h_p^\mu \right) \frac{\partial h_p^\mu }{\partial w_{ij}} = \bar{w}_{ki} f' \left( h_i^\mu \right) x_j^\mu

Substituting this back into our main derivation, we obtain:

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu} \left[ \sum_k \bar{e}_k^\mu \bar{f}' \left( \bar{h}_k^\mu \right)  \bar{w}_{ki} \right] f' \left( h_i^\mu \right) x_j^\mu

Formally speaking, this looks exactly like our expression for \frac{\partial E}{\partial \bar{w}_{ij}} derived above provided we define the error in the first layer’s outputs as:

 e_i^\mu \equiv \sum_k \bar{e}_k^\mu \bar{f}' \left( \bar{h}_k^\mu \right)  \bar{w}_{ki}

which yields the compact expression (compare with :

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu} e_i^\mu f' \left( h_i^\mu \right) x_j^\mu

I will leave it up to you as an exercise (don’t you just hate it when someone says that? ;) ) to show that:

\frac{\partial E}{\partial b_j} = \sum_{\mu} e_j^\mu f' \left( h_j^\mu \right)

 Now What?

Having computed the derivatives, all we need to do is adjust our weights and biases according to some gradient descent method. In the next installment I will give a “cookbook recipe” for doing that - no need to turn any mental gears. After that we’ll discuss how to initialize the weights and biases in the first place, and conclude with a discussion of some MATLAB code that will put the concepts in this tutorial to use. Until next time - have fun!

Comments

Leave a Reply