I Declare Myself the #1 Enemy of Over/Undersampling, SMOTE and ADASYN, Here’s Why & How I Circumvent It

Juan Esteban de la Calle
5 min readJul 31, 2023

--

Introduction

In machine learning, imbalanced data sets are a common and challenging problem. As data scientists, we often find ourselves in scenarios where we’re trying to build models with data where one class greatly outnumbers the other. A classic solution to this issue has been the use of over/undersampling techniques. These methods balance the data by either increasing the instances of the minority class (oversampling) or decreasing the instances of the majority class (undersampling). But, after many projects and ample experience, I have come to the conclusion that these techniques are not the best way to deal with imbalanced datasets. Indeed, I have declared myself the #1 enemy of over/undersampling techniques, and here’s why.

This image shows how SMOTE alters the underlying distribution of the data

Over/Undersampling: Not as Effective as You Might Think

Over/undersampling methods come with a handful of drawbacks and risks that may be overlooked in the face of a seemingly quick and easy solution. Let’s delve deeper into the issues associated with these techniques.

Overfitting and Under-representation: The oversampling technique, while it may seem logical at first glance, carries a risk of overfitting, especially in the case of simple methods such as random oversampling, which duplicates instances of the minority class. Overfitting occurs when the model, instead of learning the general patterns in the data, starts to memorize these instances. As a result, although the model might perform well on training data, it is likely to perform poorly on unseen data.

On the other hand, undersampling can result in the loss of significant information from the majority class, leading to the under-representation of crucial patterns in the data. Hence, both oversampling and undersampling can lead to models that are incapable of generalizing well to new, unseen data.

Distortion of Original Data Distribution: Perhaps the most critical issue with over/undersampling is the distortion of the original data distribution. The essence of machine learning is to learn from the inherent patterns in the data. When we manipulate the data distribution using these techniques, we change these inherent patterns. The model might perform well on the manipulated data, but once it’s exposed to real-world data with its original distribution, its performance could drastically degrade. This could lead to models that fail to deliver the desired results when deployed in real-world applications.

Performance Metric Discrepancies: Moreover, balancing classes can lead to performance metric discrepancies. For instance, using over/undersampling to balance class distribution might artificially increase model accuracy. However, this doesn’t mean that the model will perform equally well on real-world, imbalanced data. We could end up with a model that seems impressive during testing but fails to deliver satisfactory results when it matters most.

The Alternatives: Better Ways to Handle Imbalance

Having highlighted the pitfalls of over/undersampling techniques, let’s explore alternatives that could potentially yield better results while avoiding the drawbacks of the traditional methods.

Cost-Sensitive Learning: Rather than altering the data itself, we can make our machine learning algorithm aware of the imbalanced nature of our data. This is done through cost-sensitive learning, where we assign a higher cost to misclassifying instances of the minority class. This way, the algorithm will learn to pay more attention to the minority class, knowing that mistakes with this class will have a higher penalty. Cost-sensitive learning, therefore, offers a way to adjust the learning process itself rather than the data, leading to more reliable and robust models.

Use Better Evaluation Metrics: Accuracy is not always the best metric to use when dealing with imbalanced data. Therefore, instead of focusing on overall accuracy, we should use more suitable metrics that consider both the majority and minority classes. These include precision, recall, F1 score, or the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Among these, the Precision-Recall curve is a very potent tool for imbalanced datasets as it focuses directly on the minority class and can provide more meaningful insights about the model’s performance.

Ensemble Methods: Ensemble methods, such as bagging and boosting, can be effective in dealing with imbalanced data. For instance, Random Forest, a bagging technique, can ensure that each tree in the ensemble is trained on a balanced subset of the data. These methods build multiple models and combine their predictions, which often results in more robust and reliable models than a single model would provide.

Fine-tuning Hyperparameters of Powerful Algorithms: Some machine learning algorithms, like XGBoost, provide specific parameters for handling imbalanced data. The ‘scale_pos_weight’ parameter in XGBoost, for example, can be adjusted to provide more weight to the minority class during the model training process. By fine-tuning such hyperparameters, we can make the model more sensitive to the minority class and potentially achieve better performance.

Data Collection: Arguably the most ideal solution to class imbalance is collecting more data. Of course, it’s not always feasible or cost-effective to collect more instances of the minority class. However, if it’s possible, acquiring more data can provide the algorithm with more information to learn from, reducing the imbalance naturally.

Conclusion: A More Thoughtful Approach to Imbalanced Data

Handling imbalanced data is no easy task, and there’s no one-size-fits-all solution. While over/undersampling techniques have their place in the toolkit of a data scientist, it’s crucial to understand their potential pitfalls and limitations. It’s not just about balancing classes; it’s about maintaining the integrity of our data and building models that can truly learn from it and generalize to unseen data.

In this era of ever-evolving machine learning techniques, I look forward to exploring and advocating for novel and more effective ways to handle imbalanced datasets without resorting to over/undersampling. Yes, it’s a challenge, but the beauty of our field lies in overcoming these challenges to extract valuable insights from our data. After all, the goal is not just to build models with high accuracy on a balanced dataset, but to build models that can provide genuine value and insights from any data they encounter.

Ultimately, it’s not about us against the data. It’s about us working with the data to uncover its stories, its patterns, and its insights. Overcoming the challenge of imbalanced datasets is part of that journey, and it’s a journey that I’m excited to continue, with or without over/undersampling.

--

--