Introduction to Neural Networks — Part 2
Now that we have seen how a neural network is represented, we can go on to see how exactly it works. Since there are many layers having many neurons, there exists a complex set of weights to get an output from some input variables. Each weight in this network can be changed and hence there are countless configurations a neural network can have. A trained neural network has some weights configuration which accurately predicts correct outputs from some input data and that is what we hope to achieve. We will now go through how exactly a neural network trains itself to get this desirable weight configuration.
Backpropagation is the name of the algorithm a neural network uses to train itself. This revolutionary algorithm is a mixture of the chain rule in derivation and gradient descent, which is a common optimization algorithm which is used in linear and logistic regression.
To understand how backpropagation works, first we have to understand the relationship between the output and the weights in between. It is clear that every weight in the neural network will affect the output in some way due to the way the neural network is connected. Due to this fact, we can say that if I change a particular weight, the output will change in some way. We can also find the exact mathematical equation defining the relationship between each weight and the output.
Let’s say we are given a dataset with three input features, (X1, X2 and X3) and we need to find the relationship between the input features and the output using a neural network. We will now see what exactly the neural network does.
First step is to feed in the data and getting the output from the neural network. We shall call the output, Y_pred. Next we should compare the predicted value with the actual value. This value will be called the error. It is essentially how bad or how far off the model is from predicting the values correctly. Our goal is to now minimize the error.
Since Y_pred is a function of all the weights in the model and Error is a function of Y_pred, we can say that the Error will also depend on the weights. This means that we need to adjust our weights in such a way that the error is minimized. We do that using partial derivatives.
Let’s take a simple example of an equation, Y = W1*X1 + W2*X2. If we find the partial derivative of W1 or W2 with respect to Y, we can find out how W1 or W2 can affect Y.
If the partial derivative of W1 with respect to Y is POSITIVE, that means DECREASING the weight will DECREASE Y.
If the partial derivative of W1 with respect to Y is NEGATIVE, that means INCREASING the weight will DECREASE Y.
These two rules are all we need to optimize the weights in the neural network. We need to find the partial derivative of every weight with respect to the error. If that partial derivative is positive, then we decrease the value of that weight so that error gets decreased. If that partial derivative is negative, then we increase that weight so that the error gets decreased. This is the basic underlying concept of how weights are updated after we calculate the error.
Since the last layer is the closest to the error, we will first derive the last layer with respect to the error and update those weights. Then we will move to the second last layer to do the same and so on and so forth. We repeat this process till we reach the first layer and all the weights are updated. This entire process is called backpropagation.
We perform backpropagation for a single row of data and update the weights. We then repeat for all the data available in the training data set, this entire cycle is called one epoch. Usually neural networks can take several epochs to train and it is up to us to decide how many epochs it will train for.
Error Calculation Methods
Error is a very important part of a neural network because it allows us to estimate how poorly the model is performing and accordingly we can update our weights to improve performance. Now let’s go through how exactly error can be calculated in neural networks. The type of error calculation method we choose will depend upon the type of task we are trying to do. There are mainly two types of tasks a typical neural network can do:
Regression is when our output variable is continuous in nature. When Y is a numerical variable which we have to predict, the task is called regression. Examples include trying to predict house prices or trying to predict how many marks someone will score in an exam.
Classification is when our output variable is discrete in nature. When Y is trying to represent a certain class out of a defined number of classes, the task is called classification. Examples include trying to predict between a cat and a dog, or trying to predict whether an email is spam or not.
Here, we have to understand that in the regression graph the output variable is a continuous value just like X. In the classification graph, there are 2 input variables (X1 and X2) and the output is represented by the color of the points. Eg: Y = 0 for RED and Y = 1 for GREEN.
Let’s first cover regression.
For regression, the error function we use is Mean Squared Error. We are simply trying to find the difference between the actual value and predicted value, and square it. The formula for the same is given below:
Here, E is the error of the model. Y pred is the output of the model for the data and Y is the actual output which we are supposed to get. We square this so that negative and positive differences both become positive in the end. N represents the number of data rows we input into the model. We are essentially calculating error of all points and averaging them. As we train the model we hope to achieve a very low MSE so that we know that our neural network is predicting values which are close to the actual values.
Moving on to classification.
In classification the way we represent data is completely different. Because we are dealing with multiple classes which may not be in a numerical format, we have to convert them into a numerical format so that the neural network can process them. We typically convert them using the method called one hot encoding.
Let’s take the example of having three cities as classes and we have to predict between them. San Francisco, New York and Boston. We will then have an array of size 3 representing Y for each row of data. It is done in the following way:
The array size will indicate the total number of classes and the position of 1 in the array will represent the class.
Before we calculate the error in this scenario. We also have to ensure that the last layer of the neural network has a number of neurons equal to the number of classes. The last layer of the neural network should also be having the activation function of softmax. This activation function essentially converts the last layer into a probability distribution. Hence the sum of all the values of the nodes in the last layer will be equal to 1.
(Note: If there are only two classes we can use just one node and use Sigmoid instead of Softmax)
Now that it’s a probability distribution and all the values are between 1 and 0, we can proceed to calculate the error. The error we use here is called Log Loss, or Categorical Cross Entropy. What this function basically does is to compare each element of Y pred and Y and see how far apart they are. Then we average this value across all elements.
From the classification perspective, when we predict 0 at a particular element but the actual value is 1, or when we predict 1 and the actual value is 0, that means our model is performing extremely bad and we have to have a huge error in this case. This is the main concept of this error function.
Having this kind of a error function forces the neural network to understand where it is going wrong much quicker and hence allows it to learn from the data much faster..
This concludes this two part tutorial on Neural Networks. I will be going much deeper into some of the concepts we have covered today in future articles and I will be going through several technical tutorials as well. Thank you for reading!