Loss Functions Explained
In any deep learning project, configuring the loss function is one of the most important steps to ensure the model will work in the intended manner. The loss function can give a lot of practical flexibility to your neural networks and it will define how exactly the output of the network is connected with the rest of the network.
There are several tasks neural networks can perform, from predicting continuous values like monthly expenditure to classifying discrete classes like cats and dogs. Each different task would require a different type of loss since the output format will be different. For very specialized tasks, it’s up to us how we want to define the loss.
From a very simplified perspective, the loss function (J) can be defined as a function which takes in two parameters:
- Predicted Output
- True Output
This function will essentially calculate how poorly our model is performing by comparing what the model is predicting with the actual value it is supposed to output. If Y_pred is very far off from Y, the Loss value will be very high. However if both values are almost similar, the Loss value will be very low. Hence we need to keep a loss function which can penalize a model effectively while it is training on a dataset.
If the loss is very high, this huge value will propagate through the network while it’s training and the weights will be changed a little more than usual. If it’s small then the weights won’t change that much since the network is already doing a good job.
This scenario is somewhat analogous to studying for exams. If one does poorly in an exam, we can say the loss is very high, and that person will have to change a lot of things within themselves in order to get a better grade next time. However if the exam went well, then they wouldn’t do anything very different from what they are already doing for the next exam.
Now let’s look at classification as a task and understand how the loss functions work in this case.
When a neural network is trying to predict a discrete value, we can consider it to be a classification model. This could be a network trying to predict what kind of animal is present in an image, or whether an email is spam or not. First let’s look at how the output is represented for a classification neural network.
The number of nodes of the output layer will depend on the number of classes present in the data. Each node will represent a single class. The value of each output node essentially represents the probability of that class being the correct class.
Pr(Class 1) = Probability of Class 1 being the correct class
Once we get the probabilities of all the different classes, we will consider the class having the highest probability to be the predicted class for that instance. First let’s explore how binary classification is done.
In binary classification, there will be only one node in the output layer even though we will be predicting between two classes. In order to get the output in a probability format, we need to apply an activation function. Since probability requires a value in between 0 and 1 we will use the sigmoid function which can squish any real value to a value between 0 and 1.
As the input to the sigmoid becomes larger and tends to plus infinity, the output of the sigmoid will tend to 1. And as the input becomes smaller and tends to negative infinity, the output will tend to 0. Now we are guaranteed to always get a value between 0 and 1, which is exactly how we need it to be since we require probabilities.
If the output is above 0.5 (50% Probability), we will consider it to be falling under the positive class and if it is below 0.5 we will consider it to be falling under the negative class. For example if we are training a network to classify between cats and dogs, we can assign dogs the positive class and the output value in the dataset for dogs will be 1, similarly cats will be assigned the negative class and the output value for cats will be 0.
The loss function we use for binary classification is called binary cross entropy (BCE). This function effectively penalizes the neural network for binary classification task. Let’s look at how this function looks.
As you can see, there are two separate functions, one for each value of Y. When we need to predict the positive class (Y = 1), we will use
Loss = -log(Y_pred)
And when we need to predict the negative class (Y = 0), we will use
Loss = -log(1-Y_pred)
As you can see in the graphs. For the first function, when Y_pred is equal to 1, the Loss is equal to 0, which makes sense because Y_pred is exactly the same as Y. As Y_pred value becomes closer to 0, we can observe the Loss value increasing at a very high rate and when Y_pred becomes 0 it tends to infinity. This is because, from a classification perspective, 0 and 1 have to be polar opposites due to the fact that they each represent completely different classes. So when Y_pred is 0 when Y is 1, the loss will have to be very high in order for the network to learn it’s mistakes more effectively.
We can mathematically represent the entire loss function into one equation as follows:
This loss function is also called as Log Loss. This is how the loss function is designed for a binary classification neural network. Now let’s move on to see how the loss is defined for a multiclass classification network.
Multiclass classification is appropriate when we need our model to predict one possible class output every time. Now since we are still dealing with probabilities it might make sense to just apply sigmoid to all the output nodes so that we get values between 0–1 for all the outputs, but there is an issue with this. When we are considering probabilities for multiple classes, we need to ensure that the sum of all the individual probabilities is equal to one, since that is how probability is defined. Applying sigmoid does not ensure that the sum is always equal to one, hence we need to use another activation function.
The activation function we use in this case is softmax. This function ensures that all the output nodes have values between 0–1 and the sum of all output node values equals to 1 always. The formula for softmax is as follows:
Let’s visualize this with an example:
So as you can see, we are simply passing all the values into a exponential function. After that, to make sure they are all in the range of 0–1 and to make sure the sum of all the output values equals to 1, we are just dividing each exponential with the sum of all exponentials.
So why do we have to pass each value through an exponential before normalizing them? Why can’t we just normalize the values themselves? This is because the goal of softmax is to make sure one value is very high (close to 1) and all other values are very low (close to 0). Softmax uses exponential to make sure this happens. And then we are normalizing because we need probabilities.
Now that our outputs are in a proper format, let’s go ahead to look at how we configure the loss function for this. The good thing is that the loss function is essentially the same as that of binary classification. We will just apply log loss on each output node with respect to its respective target value and then we will find the sum of this across all output nodes.
This loss is called as Categorical Cross Entropy. Now let’s move onto a special case of classification called multilabel classification.
Multilabel classification is done when your model needs to predict multiple classes as the output. For example, let’s say you are training a neural network to predict the ingredients present in a picture of some food. There will be multiple ingredients we need to predict so there will be multiple 1’s in Y.
For this we can’t use softmax because softmax will always force only one class to become 1 and other classes to become 0. So instead we can simply keep sigmoid on all the output node values since we are trying to predict each class’s individual probability.
As for the loss we can directly use log loss on each node and sum it, similar to what we did in multiclass classification.
Now that we have covered classification, let’s now move on to regression.
In regression, our model is trying to predict a continuous value. Some examples of regression models are:
- House price prediction
- Person Age prediction
In regression models, our neural network will have one output node for every continuous value we are trying to predict. Regression losses are calculated by performing direct comparisons between the output value and the true value.
The most popular loss function we use for regression models is the mean squared error loss function. In this we simply calculate the square of the difference between Y and Y_pred and average this over all the data. Suppose there are n data points:
Here Y_i and Y_pred_i refer to the i’th Y value in the dataset and the corresponding Y_pred from the neural network for the same data.
That concludes this article. Hopefully now you have a deeper understanding of how loss functions are configured for various tasks in deep learning. Thank you for reading!