This is Part 3 of the tutorial series. Please also see the other parts (Part 1, Part 2, Part 3.5).
Even though it is possible to build an entire neural network from scratch using only the PyTorch Tensor
class, this is very tedious. And since most neural networks are based on the same building blocks, namely layers, it would make sense to generalize these layers as reusable functions. That is exactly what PyTorch provides with its torch.nn
package.
The torch.nn
package depends on autograd
(as discussed in Part 2) to define the network models as well as to differentiate them (backpropagate). We usually define a new class for our model that extends class nn.Module
. A nn.Module
contains layers, as well as a methods forward(input)
that returns a Variable
which we will call output
.
In the image below is the network we will build in this part, which we can use to classify hand written digits. We will use the popular MNIST dataset, which contains a training set of 60,000 labeled images and a test set of 10,000 labeled images.
It is a simple feedforward convolutional neural network (CNN), which takes a 28 x 28 pixel, greyscale, input image, that is then fed through several layers, one after the other, and finally gives an output vector, which contain the log probability (since we will use the Negative Log Likelihood loss function) that the input was one of the digits 0 to 9. I will not explain concepts like convolution, pooling or dropout in this post. You can learn more about that here.
Training the network means that you have a dataset of matching inputoutput pairs. So if you give a hand written digit of a 5 as an input, know what the expected output is, in this case a vector of zeros with a one at index 5 (this is also called onehot encoding). A typical training procedure for a neural network is therefore as follows:

 Define the neural network which has some learnable parameters, often called weights.
 Iterate over the dataset or inputs (could also be done as batches).
 Process the input through the network and calculate the output.
 Compute the loss (how far the calculated output differed from the correct output)
 Propagate the gradients back through the network.
 Update the weights of the network according to a simple update rule. Such as:
weight = weight  learning_rate * gradient
Let’s look at how to implement each of these steps in PyTorch.
1. Define the network
The most convenient way of defining our network is by creating a new class which extends nn.Module
. The Module
class simply provides a convenient way of encapsulating the parameters, and includes some helper functions such as moving data parameters to GPU, etc.
A network is usually defined in two parts. First we initialize all the functions that we will use (these can be reused multiple times). And then in the required forward()
function we “connect” our network together using the components defined in the initialize function as well as any of the Tensor operations. We can also use all the activation functions, such as ReLu and SoftMax, which is provided in the torch.nn.functional
package. This will look as follows:
import torch from torch.autograd import Variable import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() # define all the components that will be used in the NN (these can be reused) self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.mp = nn.MaxPool2d(2, padding=0) self.drop2D = nn.Dropout2d(p=0.25, inplace=False) self.fc1 = nn.Linear(320,100) self.fc2 = nn.Linear(100,10) def forward(self, x): # define the acutal network in_size = x.size(0) # get the batch size # chain function together to form the layers x = F.relu(self.mp(self.conv1(x))) x = F.relu(self.mp(self.conv2(x))) x = self.drop2D(x) x = x.view(in_size, 1) # flatten data, 1 is inferred from the other dimensions x = self.fc1(x) x = self.fc2(x) return F.log_softmax(x) net = Net() net.cuda() #makes model run on GPU print(net) #OUTPUT #Net ( # (conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1)) # (conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1)) # (mp): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) # (drop2D): Dropout2d (p=0.25) # (fc1): Linear (320 > 100) # (fc2): Linear (100 > 10) #)
Note that since we are only using built in functions we do not have to define the backward
function (which is where the gradients are computed), since this is automatically determined by the autograd
package. Also, since we would probably want to train our network using the GPU, we can achieve this by simply calling net.cuda()
. (Note that there is also an alternative way the neural network can be defined using PyTorch’s Sequential class. We built the same model using this in Part 3.5)
The learnable parameters of the model are returned by net.parameters()
, and for interest sake you can view the size of each layer’s weights, and retrieve the actual weight values for the kernels that are used (see code snippet below). These weights are often visualized to gain some understanding into how neural networks work.
params = list(net.parameters()) print(len(params)) print(params[0].size()) # conv1's weights size print(params[0][0,0]) # conv1's weights for the first filter's kernel
2. Iterate over dataset or inputs
The input that the network must be a autograd.Variable
, as is the output. But first, how do we process our dataset in a simple way to iterate over it?
Loading the data
PyTorch uses the DataLoader
class to load datasets. It is a very versatile class, which can automatically divide our data into matches as well as shuffle it among other things. It can be used to load supplied or custom datasets, that can be defined using the Dataset
class. Since we will use a supplied dataset, we will not explain how to create custom datasets in this post. For a more detailed tutorial how to do this, see this article.
Mini note on batching for PyTorch
torch.nn
only supports minibatches The entiretorch.nn
package only supports inputs that are a minibatch of samples, and not a single sample.For example,
nn.Conv2d
will take in a 4D Tensor ofnSamples x nChannels x Height x Width
.If you have a single sample, just use
input.unsqueeze(0)
to add a fake batch dimension.
Since the MNIST dataset is so commonly used, we can get the already processes dataset for free in torchvision
, which should have been installed during Part 1 of this series. Using torchvision
and DataLoader
, we can create our training and test dataset as follows:
from torch.utils.data import DataLoader from torchvision import datasets, transforms batch_size = 100 train_dataset = datasets.MNIST(root='./data/', train=True, transform=transforms.ToTensor(), download=True) test_dataset = datasets.MNIST(root='./data/', train=False, transform=transforms.ToTensor(), download=True) # batch the data for the training and test datasets train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True) print(train_loader.__len__()*train_loader.batch_size, 'train samples') print(test_loader.__len__()*test_loader.batch_size, 'test samples\n')
Now we can iterate over the dataset by simply using a forloop.
Iterating over the dataset
In order to run several epochs we will define a new function train()
to run our training loop.
When we are busy training our network we need to set it in “training mode”, this effectively only means that we would like the Dropout and BatchNorm layers to be active (we generally turn them off when running our test data). We do this by simply call net.train()
. Again, since we would want all our data on the GPU, to increase performance, we will convert all our Tensors to their GPU version using data.cuda()
. And finally, remember our network module requires the input to be of type Variable
, so we simply cast our image and target to that type. The first part of our train()
function will look as follows:
def train(epoch): net.train() # set the model in "training mode" for batch_idx, (data, target) in enumerate(train_loader): data, target = Variable(data.cuda()), Variable(target.cuda())
3. Process input data through the network. Get Output.
Since we have already done all of the difficult work in setting up the network. This is literally only one line, which we will add in the loop at the end of the last code snippet.
output = net(data)
Since we defined our network to use the Log Softmax at the end, the output will the contain the Log of the probability that the input was for each of the digits from 0 to 9. The reason we used the Log Softmax, is because we will use the Negative Log Likelihood loss function, which expects the Log Softmax as input.
4. Compute the loss
A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.
When designing a neural network, you have a choice between several loss functions, some of are more suited certain tasks. Since we have a classification problem, either the Cross Entropy loss or the related Negative Log Likelihood (NLL) loss can be used. In this example we will use the NLL loss. In order to better understand NLL and Softmax, I highly recommend you have a look at this article. But in summary, when training our model, we aim to find the minima of our loss function, and at each step we will update our parameters (weights) to ensure a lower loss in the future. Since the softmax can be interpreted as the the probability the that the input belongs to one of the output classes, and this probability is between 0 and 1, when taking the log of that value, we find that the value increases (and is negative), which is the opposite of what we want, so we simply negate the answer, hence the Negative Log Likelihood. The internal formula for the loss is as follows:
In PyTorch their is a build in NLL function in torch.nn.functional
called nll_loss
, which expects the output in log form. That is why we calculate the Log Softmax, and not just the normal Softmax in our network. Using it as is simple as adding one line to our training loop, and providing the network output, as well as the expected output.
loss = F.nll_loss(output, target)
Next we backpropogate!
5. Propagate the gradients back through the network
In this step we only calculate the gradients, but we don’t use them yet. That happens in the next step. We would like to calculate the gradients of the loss relative to the input, so in order to do this just leverage the power of PyTorch’s autograd
and call the .backward()
function on the loss
variable. We however first need to clear the existing gradients, otherwise gradients will be accumulated to existing gradients. For this we will use .zero_grad()
, before we will call the .backward()
function, but we will only show it in the next step. For now we simply add one more line to our training loop:
loss.backward()
Now that we have calculated the gradients, next we need to update the weights using a simple optimization strategy.
6. Update the weights of the network
When updating the weights there are several options we can use. The simplest (and rather effective) rule that is used in practice is called Stochastic Gradient Descent (SGD), which is calculated as follows:
weight = weight  learning_rate * gradient
SGD as well as other update rules such as NesterovSGD, Adam, RMSProp, etc. are built into a small package called torch.optim
. When using SGD, we can also set the learning rate (how drastically the weights should be updated) as well as other parameters, such as momentum. We will define our optimizer directly after our model, as follows:
import torch.optim as optim optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
Now within our train()
function we must remember to zero our gradients before calling .backward()
as well as tell our optimizer to “step”, meaning that the weights will be updated using the calculated gradients according to our rule. Since our entire training loop is finished now, here is the function in its entirety:
def train(): net.train() # set the model in "training mode" for batch_idx, (data, target) in enumerate(train_loader): # data.cuda() loads the data on the GPU, which increases performance data, target = Variable(data.cuda()), Variable(target.cuda()) optimizer.zero_grad() # necessary for new sum of gradients output = net(data) # call the forward() function (forward pass of network) loss = F.nll_loss(output, target) # use negative log likelihood to determine loss loss.backward() # backward pass of network (calculate sum of gradients for graph) optimizer.step() # perform model perameter update (update weights)
Testing our trained network
Now that we have finished training our model, we will probably also want to test how well our model was generalized by applying it to on our test dataset. For this we essentially copy our training function and just modify it to set the model in “evaluation mode” using net.eval()
, which will turn Dropout and BatchNorm off. We would also like to accumulate the loss and print out the accuracy. Here is the code for the test()
function (see the comments for more info):
def test(epoch): net.eval() # set the model in "testing mode" test_loss = 0 correct = 0 for data, target in test_loader: data, target = Variable(data.cuda(), volatile=True), Variable(target.cuda()) # volatile=True, since the test data should not be used to train... cancel backpropagation output = net(data) test_loss += F.nll_loss(output, target, size_average=False).data[0] #fsize_average=False to sum, instead of average losses pred = output.data.max(1, keepdim=True)[1] correct+= pred.eq(target.data.view_as(pred)).cpu().sum() # to operate on variables they need to be on the CPU again test_dat_len = len(test_loader.dataset) test_loss /= test_dat_len # print the test accuracy print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format( test_loss, correct, test_dat_len, 100. * correct / test_dat_len))
Rinse and repeat
Usually our model is not trained very well after only running it once. We would therefore want to train (and test) our model for several epochs. We will do this in our main function.
epochs = 10 if __name__ == '__main__': for epoch in range(1,epochs): train() test(epoch)
And that’s it! Now you know how to build, train and test a convolutional neural network! Well done! The full Python Script can be found here.
Conclusion
In this part of the series we learned how to build a convolutional neural network. How to load data, and looked specifically at the MNIST dataset. We also learned how to do the forward pass, calculate the loss, calculate the gradients, and updating the weights using Stochastic Gradient Descent. We also tested our network and ran it for several epochs. The full source, with some timing and loss plotting can be found here.
We also implemented this same CNN using PyTorch’s Sequential class, which you can find in Part 3.5.
In the next few parts we will look at different neural net architectures, as as Recurrent Neural Networks (RNNs), its variation the Long Short Term Memory Network (LSTM) as well as Generative Adversarial Networks (GANs), and who knows where we will end up…