Zero to Deep Learning Blog

How to choose the correct loss function for your neural network

Neural networks learn from their mistakes, just like (most) humans, yet less complicated.

Unlike children, who learn from their parents’ response, neural nets require a more formal and mathematical definition of a mistake. They need a loss function, also called error, cost function, or objective function.

The common framework for all neural networks and many machine learning techniques is as follows:

  1. Define a model that depends on parameters
  2. Assign a goal, defined as an objective function
  3. Find the best parameters that optimize this function

This is how supervised learning works. The supervision comes from using a loss function to compare the model predictions with the expected answers.

Usually, we formulate the optimization problem as a minimization rather than a maximization. This is why the objective function got the name loss or error.

In the simple case where you have just one parameter, the loss function looks like the parabola figure below (convex loss function against a parameter value). In the case of neural networks, you can have millions or even billions of parameters, and the loss is a function of those, so it occupies a vast multi-dimensional space. Still, the principle is similar: in both cases, we aim to find the value of the parameter that reduces the loss.

Choosing a good loss function definition plays a pivotal role in the quality of the final model.

For every task in deep learning, there are several loss functions which can result in different models. In the next paragraphs, we will go through the most common loss functions.

Loss functions for regression problems

Let’s start by looking at loss functions that work well for regression problems. Regression tasks require the neural network to output one or more real numbers such as age, income, location of a face in the image, the pixels of an image, etc.

One way to define the error for one sample is as follows:

y_{true} - y_{pred}

which is the difference between true and predicted value. For example: my true age is 37 while the model used my picture and predicted my age to be 31. In this case, the model error is: 37 – 31 = 6 . Similarly, if the model predicted 37, then it wouldn’t have made an error, and the loss value would have been 0. 

As for most loss functions, it is good practice to combine the value of the loss over multiple samples by taking the average:

\frac{1}{N} (y_{true} - y_{pred})

where N is the number of samples.

However, the average loss defined above is not good because it suffers from one problem, which is negative values. Let’s examine it in detail in the following example:

Let’s calculate the loss as we defined it previously, using the values in the table:

\text{Loss} = \frac{(1-0) +(2-2) +(3-5) +(4-3) +(5-9) +(6-2) +(7-7)}{7} = \frac{1+0-2+1-4+4+0}{7} = 0

We can see that the negative sample losses cancel out the positive sample losses and we end up with a loss value of 0 for a model that performs badly.

Hence we need to improve our loss function by dealing with negative values.

Example: if the true value was 5, and the model predicted 3 or 7, the error should be the same because the prediction is two steps away from the actual value in both cases.

There are two common methods for dealing with the negative values, which are:

  1. Absolute value: taking the absolute value of the per-sample loss.

|y_{true} - y_{pred}|

2. Square (power of 2): taking the square of the per-sample loss.

(y_{true} - y_{pred})^2

Correspondingly, these are two of the most common regression loss functions:

  1. Mean absolute error (MAE):

\frac{1}{N} |y_{true} - y_{pred}|

2. Mean squared error (MAE):

\frac{1}{N} (y_{true} - y_{pred})^2

Using the example above, we get:

\text{MAE} = \frac{|1-0| +|2-2| + ... +|7-7|}{7} = \frac{1+0+2+1+4+4+0}{7} = 1.71

\text{MSE} = \frac{(1-0)^2 +(2-2)^2 + ... +(7-7)^2}{7} = \frac{1+0+4+1+16+16+0}{7} = 5.42

MAE and MSE are similar since they both range between 0 and +∞, but they provide different models, given the following differences:

  1. MSE penalizes large errors more than MAE.
  2. MSE has a continuous derivative, while MAE doesn’t (only at 0).
  3. MAE involves simpler calculations and yields faster training.
  4. MAE penalizes small errors (under 1) more than MSE.

Preferably make MSE your default loss function for regression problems but also consider using MAE in the cases where robustness for outliers is needed.

Even though MAE and MSE are the most common loss functions for regression tasks, there exist many other losses that are useful in domain-specific problems.

Loss functions for classification problems

Now we move to the other side of the spectrum for supervised learning: classification.  This is the task of assigning a sample to one or more classes (categories). Classification tasks with neural networks don’t just give you a label but also a probability associated with it.

Depending on the number of classes, we typically talk about binary classifications and multi-class classifications:

  1. Binary classification: our target label is a boolean, e.g.: spam/not_spam, positive_review/negative_review, fraud/not_fraud.
  2. Multi-class classification: our target label is a set of distinct categories, like cat/dog/horse/mouse. Multi-class classification itself can be of two kinds:
    1. Mutually-exclusive classes: our sample can belong to only one category at a time. This is the most common scenario. Example: car brand detection with classes being Ford/Toyota/Mercedes/Audi etc. 
    2. Non-mutually-exclusive classes: our sample can belong to multiple categories at the same time, for example, husky/dog/cat/horse/animal/human. In this case, we can have a picture of a husky, which is also a dog and an animal. 

Now that we know the different types of classification tasks, the question remains: how can we implement each one of them?

For implementation, you need to keep track of three elements:

  1. The size of the output
  2. The output activation function
  3. The output loss

Starting from binary classification, our output is only one value, which is the probability of belonging to one class (the positive class). Since it is a probability, it must:

  1. always be positive
  2. be less than or equal to 1

These two constraints are satisfied by using the sigmoid activation function after the final output of the network, which maps any value into the (0, 1) range, as shown in the picture:

A small question to test your understanding: given y_{pred}, which is the probability that a sample belongs to the positive class, what is the probability that the same sample belongs to the negative class?

The answer is simple: 1-y_{pred} . That’s because of another one of the constraints of probabilities: the sum of all possible disjoint probabilities is equal to 1.

Binary classification

Now for the loss function. Our loss function of choice is binary cross entropy. The main goal of this loss is to make our predictions as close to the true labels as possible, using the following formula:

-1*(y_{true}*\log(y_{pred}) + (1-y_{true})*\log(1-y_{pred}))

y_{true} has values either 1 or 0, leading to the loss for each sample being either -\log(y_{pred}) for the positive class or -\log (1-y_{pred}) for the negative class. 

The intuition behind this loss is the following: since y_{pred} has values in the interval (0, 1) and since reducing -\log(y_{pred}) is equivalent to increasing \log(y_{pred}), we are pushing our y_{pred} to be as close as possible to 1 (for which the logarithm would be exactly 0) for the positive class. Vice versa for the negative class: minimizing - \log(1-y_{pred}) will push y_{pred} towards 0.

Mutually-exclusive multi-class classification

Now for the multi-class classification setting, specifically the mutually exclusive case.
It is not very different from the binary classification case, except now we have multiple classes, meaning we need more than one output for the network. We need precisely as many as the classes that we have. Say our classes were Animal/Human/Desk/Table: we would have four classes, meaning we would need four output neurons in the last layer.
For the output activation function, we would use softmax, which does almost what sigmoid does but makes sure that the sum of the outputs of all the neurons is exactly 1 (i.e., that they are all considered probabilities from the same multinomial distribution).

Finally, we need a loss function. For this, we have two options depending on the type of encoding we are using for our target classes.

  1. One-hot-encoded labels (e.g., [0, 0, 0, 1] to indicate that the true prediction for this data point is “Table”, our fourth class). In this case, we use categorical cross-entropy loss.
  2. Label-encoded labels (e.g., [3] to indicate “Table”). For these we can use the sparse categorical cross-entropy loss, it is the same as the categorical cross-entropy, but it is a newer implementation that doesn’t require the labels to be one-hot encoded.

In either case, the cross-entropy loss is computed as:

- \sum_{\text{classes}} y_{true}*\log(y_{pred})

that is, the sum of y_{true}*\log(y_{pred}) for all the output classes, averaged over a sample of data points.

Non-exclusive multi-class classification

The last type of classification problems we will describe is the non-mutually exclusive classification. This case is very similar to the previous one, except every class is considered to be a binary classification problem on its own, meaning we don’t need the sum of the output probabilities to be 1. We don’t care at all about the sum of the output probabilities.
That’s why we use binary cross-entropy for every output neuron, with a sigmoid activation function.

Other losses for special cases

Of course, machine learning and deep learning aren’t only about classification and regression, although they are the most common applications. 

Other loss functions include:

  1. Pitting loss or adversarial loss: used mainly in generative neural networks, which are state of the art in image generation.
  2. Cosine distance: used in outlier detection and in classification, especially in the case where the number of classes is virtually unlimited.
  3. Triplet losses: these combine several samples at once instead of one sample, usually used in facial recognition tasks.
  4. Structure regularizers: this is a special type of loss that takes a graph representing the data in addition to the sample itself, commonly used for neural structured learning tasks, which are an improvement over simple classification. 

Of course, knowing the losses is very useful, but without a solid foundation and the right background, this knowledge can be confusing and overwhelming. Instead of spending 80% of your time jumping from one StackOverflow link to the other, join us at the Zero to Deep Learning Bootcamp. You will quickly learn the essentials of deep learning, both theory and practice, allowing you to get the right fundamentals needed for every deep learning engineer.


Leave a Reply

%d bloggers like this: