Getting started with PyTorch for Deep Learning (Part 3: Neural Network basics)

This is Part 3 of the tutorial series. Please also see the other parts (Part 1, Part 2, Part 3.5).

Even though it is possible to build an entire neural network from scratch using only the PyTorch Tensor class, this is very tedious. And since most neural networks are based on the same building blocks, namely layers, it would make sense to generalize these layers as reusable functions. That is exactly what PyTorch provides with its torch.nn package.

The torch.nn package depends on autograd (as discussed in Part 2) to define the network models as well as to differentiate them (back-propagate). We usually define a new class for our model that extends class nn.Module. A nn.Module contains layers, as well as a methods forward(input) that returns a Variable which we will call output.

In the image below is the network we will build in this part, which we can use to classify hand written digits. We will use the popular MNIST dataset, which contains a training set of 60,000 labeled images and a test set of 10,000 labeled images.


It is a simple feed-forward convolutional neural network (CNN), which takes a 28 x 28 pixel, greyscale,  input image, that is then fed through several layers, one after the other, and finally gives an output vector, which contain the log probability (since we will use the Negative Log Likelihood loss function) that the input was one of the digits 0 to 9. I will not explain concepts like convolution, pooling or dropout in this post. You can learn more about that here.

Training the network means that you have a dataset of matching input-output pairs. So if you give a hand written digit of a 5 as an input, know what the expected output is, in this case a vector of zeros with a one at index 5 (this is also called one-hot encoding). A typical training procedure for a neural network is therefore as follows:

    1. Define the neural network which has some learnable parameters, often called weights.
    2. Iterate over the dataset or inputs (could also be done as batches).
    3. Process the input through the network and calculate the output.
    4. Compute the loss (how far the calculated output differed from the correct output)
    5. Propagate the gradients back through the network.
    6. Update the weights of the network according to a simple update rule. Such as:
      weight = weight - learning_rate * gradient

Let’s look at how to implement each of these steps in PyTorch.

1. Define the network

The most convenient way of defining our network is by creating a new class which extends nn.Module. The Module class simply provides a convenient way of encapsulating the parameters, and includes some helper functions such as moving data parameters to GPU, etc.

A network is usually defined in two parts. First we initialize all the functions that we will use (these can be reused multiple times). And then in the required forward() function we “connect” our network together using the components defined in the initialize function as well as any of the Tensor operations. We can also use all the activation functions, such as ReLu and SoftMax, which is provided in the torch.nn.functional package. This will look as follows:

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # define all the components that will be used in the NN (these can be reused)
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5) = nn.MaxPool2d(2, padding=0)
        self.drop2D = nn.Dropout2d(p=0.25, inplace=False)
        self.fc1 = nn.Linear(320,100)
        self.fc2 = nn.Linear(100,10)
    def forward(self, x):
        # define the acutal network
        in_size = x.size(0) # get the batch size
        # chain function together to form the layers
        x = F.relu( 
        x = F.relu(
        x = self.drop2D(x)
        x = x.view(in_size, -1) # flatten data, -1 is inferred from the other dimensions
        x = self.fc1(x)
        x = self.fc2(x)
        return F.log_softmax(x)

net = Net()
net.cuda() #makes model run on GPU

#Net (
#  (conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))
#  (conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))
#  (mp): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
#  (drop2D): Dropout2d (p=0.25)
#  (fc1): Linear (320 -> 100)
#  (fc2): Linear (100 -> 10)

Note that since we are only using built in functions we do not have to define the backward function (which is where the gradients are computed), since this is automatically determined by the autograd package. Also, since we would probably want to train our network using the GPU, we can achieve this by simply calling net.cuda(). (Note that there is also an alternative way the neural network can be defined using PyTorch’s Sequential class. We built the same model using this in Part 3.5)

The learnable parameters of the model are returned by net.parameters(), and for interest sake you can view the size of each layer’s weights, and retrieve the actual weight values for the kernels that are used (see code snippet below). These weights are often visualized to gain some understanding into how neural networks work.

params = list(net.parameters())
print(params[0].size())  # conv1's weights size
print(params[0][0,0])  # conv1's weights for the first filter's kernel

2. Iterate over dataset or inputs

The input that the network must be a autograd.Variable, as is the output. But first, how do we process our dataset in a simple way to iterate over it?

Loading the data

PyTorch uses the DataLoader class to load datasets. It is a very versatile class, which can automatically divide our data into matches as well as shuffle it among other things. It can be used to load supplied or custom datasets, that can be defined using the Dataset class. Since we will use a supplied dataset, we will not explain how to create custom datasets in this post. For a more detailed tutorial how to do this, see this article.

Mini note on batching for PyTorch

torch.nn only supports mini-batches The entire torch.nn package only supports inputs that are a mini-batch of samples, and not a single sample.

For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.

If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.

Since the MNIST dataset is so commonly used, we can get the already processes dataset for free in torchvision, which should have been installed during Part 1 of this series.   Using torchvision and DataLoader, we can create our training and test dataset as follows:

from import DataLoader
from torchvision import datasets, transforms

batch_size = 100

train_dataset = datasets.MNIST(root='./data/', train=True, transform=transforms.ToTensor(), download=True)
test_dataset = datasets.MNIST(root='./data/', train=False, transform=transforms.ToTensor(), download=True)

# batch the data for the training and test datasets
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)

print(train_loader.__len__()*train_loader.batch_size, 'train samples')
print(test_loader.__len__()*test_loader.batch_size, 'test samples\n')

Now we can iterate over the dataset by simply using a for-loop.

Iterating over the dataset

In order to run several epochs we will define a new function train() to run our training loop.

When we are busy training our network we need to set it in “training mode”, this effectively only means that we would like the Dropout and BatchNorm layers to be active (we generally turn them off when running our test data). We do this by simply call net.train(). Again, since we would want all our data on the GPU, to increase performance, we will convert all our Tensors to their GPU version using data.cuda(). And finally, remember our network module requires the input to be of type Variable, so we simply cast our image and target to that type.  The first part of our train() function will look as follows:

def train(epoch):
    net.train() # set the model in "training mode"

    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data.cuda()), Variable(target.cuda())

3. Process input data through the network. Get Output.

Since we have already done all of the difficult work in setting up the network. This is literally only one line, which we will add in the loop at the end of the last code snippet.

output = net(data)

Since we defined our network to use the Log Softmax at the end, the output will the contain the Log of the probability that the input was for each of the digits from 0 to 9. The reason we used the Log Softmax, is because we will use the Negative Log Likelihood loss function, which expects the Log Softmax as input.

4. Compute the loss

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

When designing a neural network, you have a choice between several loss functions, some of are more suited certain tasks. Since we have a classification problem, either the Cross Entropy loss or the related Negative Log Likelihood (NLL) loss can be used. In this example we will use the NLL loss. In order to better understand NLL and Softmax, I highly recommend you have a look at this article.  But in summary, when training our model, we aim to find the minima of our loss function, and at each step we will update our parameters (weights) to ensure a lower loss in the future. Since the softmax can be interpreted as the the probability the that the input belongs to one of the output classes, and this probability is between 0 and 1, when taking the log of that value, we find that the value increases (and is negative), which is the opposite of what we want, so we simply negate the answer, hence the Negative Log Likelihood. The internal formula for the loss is as follows:loss

In PyTorch their is a build in NLL function in torch.nn.functional called nll_loss, which expects the output in log form. That is why we calculate the Log Softmax, and not just the normal Softmax in our network. Using it as is simple as adding one line to our training loop, and providing the network output, as well as the expected output.

loss = F.nll_loss(output, target)

Next we back-propogate!

5. Propagate the gradients back through the network

In this step we only calculate the gradients, but we don’t use them yet. That happens in the next step. We would like to calculate the gradients of the loss relative to the input, so in order to do this just leverage the power of PyTorch’s autograd and call the .backward() function on the loss variable. We however first need to clear the existing gradients, otherwise gradients will be accumulated to existing gradients. For this we will use .zero_grad(), before we will call the .backward() function, but we will only show it in the next step. For now we simply add one more line to our training loop:


Now that we have calculated the gradients, next we need to update the weights using a simple optimization strategy.

6. Update the weights of the network

When updating the weights there are several options we can use. The simplest (and rather effective) rule that is used in practice is called Stochastic Gradient Descent (SGD), which is calculated as follows:
weight = weight - learning_rate * gradient

SGD as well as other update rules such as Nesterov-SGD, Adam, RMSProp, etc. are built into a small package called  torch.optim. When using SGD, we can also set the learning rate (how drastically the weights should be updated) as well as other parameters, such as momentum. We will define our optimizer directly after our model, as follows:

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

Now within our train() function we must remember to zero our gradients before calling .backward() as well as tell our optimizer to “step”, meaning that the weights will be updated using the calculated gradients according to our rule. Since our entire training loop is finished now, here is the function in its entirety:

def train():
    net.train() # set the model in "training mode"

    for batch_idx, (data, target) in enumerate(train_loader):
        # data.cuda() loads the data on the GPU, which increases performance
        data, target = Variable(data.cuda()), Variable(target.cuda())

        optimizer.zero_grad()   # necessary for new sum of gradients
        output = net(data)      # call the forward() function (forward pass of network)
        loss = F.nll_loss(output, target) # use negative log likelihood to determine loss
        loss.backward()         # backward pass of network (calculate sum of gradients for graph)
        optimizer.step()        # perform model perameter update (update weights)

Testing our trained network

Now that we have finished training our model, we will probably also want to test how well our model was generalized by applying it to on our test dataset. For this we essentially copy our training function and just modify it to set the model in “evaluation mode” using net.eval(), which will turn Dropout and BatchNorm off. We would also like to accumulate the loss and print out the accuracy. Here is the code for the  test() function (see the comments for more info):

def test(epoch):
    net.eval()  # set the model in "testing mode"
    test_loss = 0
    correct = 0

    for data, target in test_loader:
        data, target = Variable(data.cuda(), volatile=True), Variable(target.cuda()) # volatile=True, since the test data should not be used to train... cancel backpropagation
        output = net(data)
        test_loss += F.nll_loss(output, target, size_average=False).data[0] #fsize_average=False to sum, instead of average losses
        pred =, keepdim=True)[1]
        correct+= pred.eq( # to operate on variables they need to be on the CPU again
    test_dat_len = len(test_loader.dataset)
    test_loss /= test_dat_len
    # print the test accuracy
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
        test_loss, correct, test_dat_len, 100. * correct / test_dat_len))

Rinse and repeat

Usually our model is not trained very well after only running it once. We would therefore want to train (and test) our model for several epochs. We will do this in our main function.

epochs = 10
if __name__ == '__main__':
    for epoch in range(1,epochs):

And that’s it! Now you know how to build, train and test a convolutional neural network! Well done! The full Python Script can be found here.


In this part of the series we learned how to build a convolutional neural network. How to load data, and looked specifically at the MNIST dataset. We also learned how to do the forward pass, calculate the loss, calculate the gradients, and updating the weights using Stochastic Gradient Descent. We also tested our network and ran it for several epochs. The full source, with some timing and loss plotting can be found here.

We also implemented this same CNN using PyTorch’s Sequential class, which you can find in Part 3.5.

In the next few parts we will look at different neural net architectures, as as Recurrent Neural Networks (RNNs), its variation the Long Short Term Memory Network (LSTM) as well as Generative Adversarial Networks (GANs), and who knows where we will end up…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s