How and Why I Switched from the ROC Curve to the Precision-Recall Curve to Analyze My Imbalanced Models: A Deep Dive

5 min readJul 10, 2023

Machine learning is transforming how we tackle problems and make decisions in an array of sectors, from healthcare to finance. Binary classification is a particularly common task in this field, where we must decide whether an input belongs to one of two distinct classes. To gauge the performance of such models, we often use metrics like the Receiver Operating Characteristic (ROC) curve. However, the Precision-Recall (PR) curve, in certain contexts, can offer a more realistic and insightful view. Let me tell you the story of why and how I made the switch.

Starting Point: The ROC Curve and the AUC

The ROC curve is a fundamental tool for evaluating binary classification models. It displays the relationship between the true positive rate (TPR) and the false positive rate (FPR) as we vary the discrimination threshold. In simpler terms, it helps us understand how our model performs at different levels of certainty.

The Area Under the Curve (AUC) serves as a handy summary statistic of the ROC curve. The AUC tells us about the probability that a randomly selected positive instance ranks higher than a randomly selected negative one. If the AUC equals 1, we have a perfect model. If the AUC equals 0.5, our model is no better than a coin flip.

The ROC curve and AUC have won over many practitioners because they are, in general, insensitive to class imbalance, where the number of instances in one class far outweighs the other. This makes them seem like a safe, universally applicable option.

Even, a multitude of analytical tools offers the AUC-ROC metric as a default for hyperparameter search, leaving the possibility of AUCPR only as a customized function.

My Doubts: Recognizing the Limitations of the ROC Curve

However, the reality is more nuanced. What I realized is that the perceived strength of the ROC curve can also be its Achilles’ heel. In situations where the dataset is highly imbalanced, the ROC curve can give an overly optimistic assessment of the model’s performance.

This optimism bias arises because the ROC curve’s false positive rate (FPR) can become very small when the number of actual negatives is large. As a result, even a large number of false positives would only lead to a small FPR, leading to a potentially high AUC that doesn’t reflect the practical reality of using the model.

My Choice: The Precision-Recall Curve

Enter the Precision-Recall (PR) curve. This curve plots Precision (the proportion of true positives among positive predictions) against Recall (the proportion of true positives identified out of all actual positives). This approach focuses exclusively on the performance within the positive class.

What this means is that a large number of negative instances won’t skew our understanding of how well our model performs on the positive class. In other words, the PR curve offers a more transparent view of a model’s performance on imbalanced datasets.

The PR curve’s focus on the positive class also aligns more closely with the business objectives in many real-world scenarios. For example, in fraud detection, the priority isn’t correctly classifying non-fraudulent transactions (the negative class). Instead, we’re most interested in catching as many fraudulent transactions (the positive class) as possible.

But the precision-recall curve isn’t just about providing a more realistic performance assessment. The area under the PR curve (AUCPR) can also be a more informative summary statistic than the traditional AUC.

The AUCPR ranges from 0 to 1, with 1 indicating a perfect classifier. However, unlike the AUC of the ROC curve, the AUCPR of a random classifier is equal to the proportion of positive instances in the dataset. This makes it a more conservative (and in many cases, more realistic) metric when dealing with imbalanced data.

Illustrating the Difference: ROC vs. PR with Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

# Create an imbalanced dataset
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2,
                           weights=[0.97, 0.03  ], random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Get the probability scores for the testing set
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate the ROC curve
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

# Calculate the Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_score)
pr_auc = average_precision_score(y_test, y_score)

# Plot the ROC curve
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")

# Plot the Precision-Recall curve
plt.subplot(1,2,2)
plt.plot(recall, precision, color='darkorange', lw=2, label='PR curve (AUC = %0.2f)' % pr_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower right")
plt.show()

In this code, we create an imbalanced dataset with 10000 samples, with only 3% of samples belonging to the positive class. We then train a random forest classifier and evaluate its performance using both the ROC and PR curves.

Moving Forward: Making the Switch to Precision-Recall

The PR curve’s robustness in the face of class imbalance is why I made the switch from ROC to PR in certain scenarios. It allows me to better assess and fine-tune my models, especially those with highly imbalanced datasets. The PR curve brings me closer to my ultimate goal in model development: not just to achieve high performance in the abstract, but to create models that are effective and reliable in the real world.

However, it’s important to remember that no single tool is a silver bullet. Different tasks may call for different performance measures. Just as the ROC curve is more suitable for balanced datasets, the PR curve is more suitable for imbalanced ones.

Conclusion

Recognizing the ROC curve’s limitations in the face of imbalanced datasets was a significant turning point in my journey as a data scientist. It prompted me to explore alternatives, leading me to the PR curve and its power in handling such datasets.

The story of my switch from the ROC curve to the PR curve is a reminder of the importance of continually questioning and reevaluating the tools we use. It’s crucial to remember that metrics are just tools, and like all tools, they work best when suited to the task at hand. The field of machine learning is vast, and we must always stay open to exploring and embracing new approaches to old problems.