The goal of this document is to show what Imbalanced support Driverless AI provides and how to enable it.
Driverless AI offers imbalanced algorithms for use cases where there is a binary, imbalanced target. These are enabled by default if the target column is considered imbalanced. While they are enabled, Driverless AI may not decide to use them in the final model due to poor performance.
While Driverless AI does try Imbalanced Algorithms by default, they have not generally been found to improve model performance. It also results in a much larger final model (multiple models are combined with different balancing ratios).
I have found better performance by making sure the appropriate scorer is selected and in certain cases suggesting a Weight column. See the Additional Suggestions
section at the bottom of the article.
Instructions for Turning on Imbalanced Algorithms
Here are instructions for turning on the Imbalanced Algorithms only:
- Go to the Expert Settings
- Click on Recipes
- Click on Include Specific Models
- Select only the imbalanced algorithms: ImbalancedXGBoost, ImbalancedLightGBM
I've added a GIF that shows this:
We have two types of imbalanced algorithms: ImbalancedXGBoost and ImbalancedLightGBM. The imbalanced algorithms will train an XGBoost or LightGBM model multiple times on different samples of data and then combines the predictions of these models together. These models will use different samples of data and may use different sampling ratios (by trying multiple ratios, we are more likely to come up with a robust model.)
When the experiment is done, you will find more details about what bagging was done in the AutoDoc. Here is an example:
Driverless AI will automatically determine the sampling done for these models. If you would prefer to specify the amount of sampling, you can set the "Target fraction of minority class". If you wanted to have 10% minority class for example, you could change this setting to 0.1. Each model would then be trained with a sample where 10% was the minority class and 90% was the majority class.
Make sure the user selects a scorer that is not biased by imbalanced data. I would recommend:
- MCC: proportion of true negatives is used instead of absolute number (in imbalanced use cases you will have a high count of true negatives)
- AUCPR: true negatives are not used at all in this calculation
A weight column can be used to internally upsample the rare events. If you create a Weight column that has a value of 10 when the target is positive and otherwise is 1, it essentially tells the algorithms internally to consider getting the positive class correct as 10x more important.
The attached notebook shows an example of trying different weight columns to see if it improves capturing the rare class.