Zero to Deep Learning Blog

How to effectively measure a classifier’s performance and interpret its metrics

After building and training a machine learning model, we need to evaluate its performance. This is also important when we want to compare several different models and we want to choose the best one. Intuitively speaking, we humans also use evaluation metrics to make decisions in our everyday life. For example, if you are trying to choose a hospital for surgery, you want to choose the hospital with the highest success rate. The same is true for classification models. We use evaluation metrics to compare them and select the best one. Some of these metrics can be represented with a single number representing the accuracy, while others are more complex.

I am sure you are familiar with the movie Titanic. Today we will answer the question of: would you have survived if you were on board? We will be using the Titanic dataset, which contains information about the passengers on the ship and their survival or death; hence it is a binary classification problem. The type of information recorded consists of Age, Sex, and Pclass (ticket class), since these were a primary factor in the order of evacuation from the ship, and other information such as the number of siblings, name, cabin, ticket number, and so on.

5 data points (rows) out of 1309; Survived = 1 means the passenger survived the crash; Survived = 0 means the passenger died during the crash.
5 data points (rows) out of 1309; Survived = 1 means the passenger survived the crash; Survived = 0 means the passenger died during the crash.

Accuracy

To start with, we will use a simple logistic regression model to predict the survival of the passengers on the ship.

Molde's predictions on a random batch of data

Let’s calculate the accuracy score of this model.

We can see that the model has predicted three correct values out of the five shown above, so the accuracy score is

\frac{3}{5}

In more general terms, we define the accuracy as the count of correctly classified samples divided by the total number of samples in the dataset. For this specific model, its accuracy over the whole dataset is 0.82 or 82%.

To further understand our model’s behavior, we need to inspect four metrics related to the model’s predictions:

  1. true positives
  2. true negatives
  3. false positives
  4. false negatives

Every sample predicted by the model in a binary classification must be in one and only one of these categories.

We can see that each concept is composed of two words (True/False) and (Positive/Negative). The first word describes whether the model has correctly classified the sample or not:

  1. True means correctly classified.
  2. False means the sample was misclassified.

The second word describes the predicted class.

  1. Positive means the predicted class was 1 (in our example it means that the model classified this passenger as a survivor).
  2. Negative means the predicted class was 0 (in our example it means that the model predicted that this passenger died on the ship).

In simple terms, a false positive is a passenger who died but was predicted to survive, and a false negative is a passenger who survived but was predicted to die.

Now we can define the accuracy score as a formula where the correctly classified instances are true positives (TP) and true negatives (TN):

Accuracy = \frac{TP + TN}{TP+FP+TN+FN}

Although the accuracy score is relatively simple, it is the most commonly used metric for evaluating classifiers. The accuracy score is always between 0 and 1, and the higher it is, the better the model is.

Establishing a baseline

One question arises: how can I say that my model is good or not given the accuracy score? Let’s say we calculated the accuracy and it is 89%. Is this a good number?

If we compare that to the accuracy of a coin flip, which is 50%, we would be inclined to say “Yes”. However, things are not that simple. This is only true if the two classes are perfectly balanced.

Incidentally, that is also true for more than two classes, in which case, the random classifier would have an accuracy of \frac{1}{\text{number of classes}} .

Unbalanced classes

What if our classes look like this figure:

University Students Distribution

In this case, we see that most of our students are undergrads, a smaller number are masters, and Ph.D. students are very few. This is a clear case of an unbalanced dataset. In such cases, we compare our model’s accuracy against the accuracy of the majority classifier. The majority classifier classifies everything as the majority class, so in this case it will classify everyone as an undergrad, leading to an accuracy of 84.7%. In more general terms, \frac{N_{\text{majority class}}}{N_{\text{dataset}}} .

As we have seen now, there are cases where accuracy is not a very good measure of comparison because it is always relative to either a random classifier or a majority classifier, meaning that you cannpot interpret it without knowing additional information about the balance in the dataset. For this purpose, other metrics are more useful in different situations.

Recall, precision, and specificity

Going back to our Titanic example, imagine a scenario where you want to make a decision about going back in time to ride this ship. Having this dataset, you will build a model, and depending on its prediction, you will either go or cancel. In this scenario, you don’t care if the model is accurate overall or not. You care mostly that the model will correctly predict any possible death situation because you cannot risk it. So, in this case, we focus more on the model’s specificity rather than the accuracy.

Specificity is the model’s ability to predict the negative class every time the model encounters a sample that is actually negative. TN and FP represent the totality of the samples that are actually negative (the true negatives, which are correctly classed as negative, and the false positives, which are incorrectly classed as positive but are negative).

\text{Specificity} = \frac{TN}{FP+TN}

The corresponding metric for the positive class is called the recall, which represents the model’s ability to predict the positive class when it is encountered. TP and FN represent all the samples that are actually positive.

\text{Recall} = \frac{TP}{TP+FN}

Another useful metric is precision, which represents the model’s accuracy after predicting a positive class. Example: if the model predicts that a patient has cancer, what is the probability of the patient truly having cancer? TP and FP represent the samples predicted as the positive class.

\text{Precision} = \frac{TP}{TP+FP}

It is important to understand that each of these metrics is represented by one value for binary classification but as a vector for multi-class classification, where each class has its recall, precision, and specificity.

As useful as these metrics are, we can easily be confused. For each metric, it’s possible to use a model that is not good yet passes as a perfect model for that metric. For example, for specificity, if we predict everything as negative, our specificity will be 100%.

Confusion matrix

For this reason, we cannot use just one metric for evaluating a model’s performances, and this is where the confusion matrix comes to play.

Multi-class Confusion Matrix

If you find this picture a little bit confusing, that’s OK. So far, we’ve said that there are two classes: one positive and one negative. That’s true in the case of binary classification. In the case of multi-class classification, every class is considered positive, and all the others are negative. So if we have a dataset with three classes {dog, cat, bird}, we can imagine it as three binary classifications:

  1. {dog/ not dog}
  2. {cat/ not cat}
  3. {bird/ not bird}

The best confusion matrix should contain numbers only on the diagonal.

In the case where we have multiple classes, the confusion matrix can become impossible to interpret, so a widespread practice is to draw it.

Example of confusion matrix for interpretation

As you can see in this confusion matrix, we can see that some hubs are being mistaken for cameras.

Debugging Keras and TensorFlow models can be challenging because you must have both the background needed to understand what is wrong and a set of technical skills that will allow you to fix it.

Instead of spending 80% of your time fixing bugs or reading how-to articles, join us at the Zero to Deep Learning Bootcamp. Quickly learn the essentials of deep learning, both theory and practice, allowing you to get things right from the start and save your time and energy.

Francesco

Leave a Reply


%d bloggers like this: