Even though it is possible to build an entire neural network from scratch using only the PyTorch
Tensor class, this is very tedious. And since most neural networks are based on the same building blocks, namely layers, it would make sense to generalize these layers as reusable functions. That is exactly what PyTorch provides with its
torch.nn package depends on
autograd (as discussed in Part 2) to define the network models as well as to differentiate them (back-propagate). We usually define a new class for our model that extends class
nn.Module contains layers, as well as a methods
forward(input) that returns a
Variable which we will call
In the image below is the network we will build in this part, which we can use to classify hand written digits. We will use the popular MNIST dataset, which contains a training set of 60,000 labeled images and a test set of 10,000 labeled images.
It is a simple feed-forward convolutional neural network (CNN), which takes a 28 x 28 pixel, greyscale, input image, that is then fed through several layers, one after the other, and finally gives an output vector, which contain the log probability (since we will use the Negative Log Likelihood loss function) that the input was one of the digits 0 to 9. I will not explain concepts like convolution, pooling or dropout in this post. You can learn more about that here.
Training the network means that you have a dataset of matching input-output pairs. So if you give a hand written digit of a 5 as an input, know what the expected output is, in this case a vector of zeros with a one at index 5 (this is also called one-hot encoding). A typical training procedure for a neural network is therefore as follows:
- Define the neural network which has some learnable parameters, often called weights.
- Iterate over the dataset or inputs (could also be done as batches).
- Process the input through the network and calculate the output.
- Compute the loss (how far the calculated output differed from the correct output)
- Propagate the gradients back through the network.
- Update the weights of the network according to a simple update rule. Such as:
weight = weight - learning_rate * gradient
Let’s look at how to implement each of these steps in PyTorch.
1. Define the network
The most convenient way of defining our network is by creating a new class which extends
Module class simply provides a convenient way of encapsulating the parameters, and includes some helper functions such as moving data parameters to GPU, etc.
A network is usually defined in two parts. First we initialize all the functions that we will use (these can be reused multiple times). And then in the required
forward() function we “connect” our network together using the components defined in the initialize function as well as any of the Tensor operations. We can also use all the activation functions, such as ReLu and SoftMax, which is provided in the
torch.nn.functional package. This will look as follows:
Note that since we are only using built in functions we do not have to define the
backward function (which is where the gradients are computed), since this is automatically determined by the
autograd package. Also, since we would probably want to train our network using the GPU, we can achieve this by simply calling
net.cuda(). (Note that there is also an alternative way the neural network can be defined using PyTorch’s Sequential class. We built the same model using this in Part 3.5)
The learnable parameters of the model are returned by
net.parameters(), and for interest sake you can view the size of each layer’s weights, and retrieve the actual weight values for the kernels that are used (see code snippet below). These weights are often visualized to gain some understanding into how neural networks work.
2. Iterate over dataset or inputs
The input that the network must be a
autograd.Variable, as is the output. But first, how do we process our dataset in a simple way to iterate over it?
Loading the data
PyTorch uses the
DataLoader class to load datasets. It is a very versatile class, which can automatically divide our data into matches as well as shuffle it among other things. It can be used to load supplied or custom datasets, that can be defined using the
Dataset class. Since we will use a supplied dataset, we will not explain how to create custom datasets in this post. For a more detailed tutorial how to do this, see this article.
Mini note on batching for PyTorch
torch.nnonly supports mini-batches The entire
torch.nnpackage only supports inputs that are a mini-batch of samples, and not a single sample.
nn.Conv2dwill take in a 4D Tensor of
nSamples x nChannels x Height x Width.
If you have a single sample, just use
input.unsqueeze(0)to add a fake batch dimension.
Since the MNIST dataset is so commonly used, we can get the already processes dataset for free in
torchvision, which should have been installed during Part 1 of this series. Using
DataLoader, we can create our training and test dataset as follows:
Now we can iterate over the dataset by simply using a for-loop.
Iterating over the dataset
In order to run several epochs we will define a new function
train() to run our training loop.
When we are busy training our network we need to set it in “training mode”, this effectively only means that we would like the Dropout and BatchNorm layers to be active (we generally turn them off when running our test data). We do this by simply call
net.train(). Again, since we would want all our data on the GPU, to increase performance, we will convert all our Tensors to their GPU version using
data.cuda(). And finally, remember our network module requires the input to be of type
Variable, so we simply cast our image and target to that type. The first part of our
train() function will look as follows:
3. Process input data through the network. Get Output.
Since we have already done all of the difficult work in setting up the network. This is literally only one line, which we will add in the loop at the end of the last code snippet.
Since we defined our network to use the Log Softmax at the end, the output will the contain the Log of the probability that the input was for each of the digits from 0 to 9. The reason we used the Log Softmax, is because we will use the Negative Log Likelihood loss function, which expects the Log Softmax as input.
4. Compute the loss
A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.
When designing a neural network, you have a choice between several loss functions, some of are more suited certain tasks. Since we have a classification problem, either the Cross Entropy loss or the related Negative Log Likelihood (NLL) loss can be used. In this example we will use the NLL loss. In order to better understand NLL and Softmax, I highly recommend you have a look at this article. But in summary, when training our model, we aim to find the minima of our loss function, and at each step we will update our parameters (weights) to ensure a lower loss in the future. Since the softmax can be interpreted as the the probability the that the input belongs to one of the output classes, and this probability is between 0 and 1, when taking the log of that value, we find that the value increases (and is negative), which is the opposite of what we want, so we simply negate the answer, hence the Negative Log Likelihood. The internal formula for the loss is as follows:
In PyTorch their is a build in NLL function in
nll_loss, which expects the output in log form. That is why we calculate the Log Softmax, and not just the normal Softmax in our network. Using it as is simple as adding one line to our training loop, and providing the network output, as well as the expected output.
Next we back-propogate!
5. Propagate the gradients back through the network
In this step we only calculate the gradients, but we don’t use them yet. That happens in the next step. We would like to calculate the gradients of the loss relative to the input, so in order to do this just leverage the power of PyTorch’s
autograd and call the
.backward() function on the
loss variable. We however first need to clear the existing gradients, otherwise gradients will be accumulated to existing gradients. For this we will use
.zero_grad(), before we will call the
.backward() function, but we will only show it in the next step. For now we simply add one more line to our training loop:
Now that we have calculated the gradients, next we need to update the weights using a simple optimization strategy.
6. Update the weights of the network
When updating the weights there are several options we can use. The simplest (and rather effective) rule that is used in practice is called Stochastic Gradient Descent (SGD), which is calculated as follows:
weight = weight - learning_rate * gradient
SGD as well as other update rules such as Nesterov-SGD, Adam, RMSProp, etc. are built into a small package called
torch.optim. When using SGD, we can also set the learning rate (how drastically the weights should be updated) as well as other parameters, such as momentum. We will define our optimizer directly after our model, as follows:
Now within our
train() function we must remember to zero our gradients before calling
.backward() as well as tell our optimizer to “step”, meaning that the weights will be updated using the calculated gradients according to our rule. Since our entire training loop is finished now, here is the function in its entirety:
Testing our trained network
Now that we have finished training our model, we will probably also want to test how well our model was generalized by applying it to on our test dataset. For this we essentially copy our training function and just modify it to set the model in “evaluation mode” using
net.eval(), which will turn Dropout and BatchNorm off. We would also like to accumulate the loss and print out the accuracy. Here is the code for the
test() function (see the comments for more info):
Rinse and repeat
Usually our model is not trained very well after only running it once. We would therefore want to train (and test) our model for several epochs. We will do this in our main function.
And that’s it! Now you know how to build, train and test a convolutional neural network! Well done! The full Python Script can be found here.
In this part of the series we learned how to build a convolutional neural network. How to load data, and looked specifically at the MNIST dataset. We also learned how to do the forward pass, calculate the loss, calculate the gradients, and updating the weights using Stochastic Gradient Descent. We also tested our network and ran it for several epochs. The full source, with some timing and loss plotting can be found here.
We also implemented this same CNN using PyTorch’s Sequential class, which you can find in Part 3.5.
In the next few parts we will look at different neural net architectures, as as Recurrent Neural Networks (RNNs), its variation the Long Short Term Memory Network (LSTM) as well as Generative Adversarial Networks (GANs), and who knows where we will end up…