Goal

The goal of this document is to provide more information about why columns are dropped in Driverless AI.  This document is meant for an internal audience.  It should be used to help with any support tickets or customer questions regarding why a feature was dropped.


Overview


In Driverless AI's AutoDoc, there is a section defining the reasons why any column was removed.  An example of this section is shown below:


The reasons listed in the AutoDoc about why a column is removed are the following:


ReasonDefinition
UserUser explicitly dropped the column
DAI ID ColumnThe column was identified as an ID column
DAI Constant ColumnThe column has the same value for all rows
DAI Data ShiftThe column is significantly shifted in the validation/test dataset so it was dropped
DAI Leakage The column is believed to consist of leakage (the column knows the answer) so it was dropped
DAI Input Feature ReductionDriverless AI dropped the column because it had low variable importance
DAI Model DroppedThe Driverless AI model decided to drop the feature.  
This label is applied if the feature is dropped for any reason that is not the 5 listed above. More details on this label is defined in the DAI Model Dropped section.


DAI Leakage

If the reason a column is dropped is listed as "DAI Leakage", it means that the column probably has the answer in it. For example, if I am predicting Age and I have Year_of_Birth as a feature, I know that the answer is accidentally leaked by Year_of_Birth


If Driverless AI drops the feature for this reason, you can find more details in the Notifications tab and the logs. You will see a message like this:


Possible data leakage detected
Possible leakage detected in training data for feature zcta5_cd ( AUC: 0.9283551 )
Possible leakage detected in training data for feature st_cd ( AUC: 0.9248665 )
Possible leakage detected in training data for feature flsa_cmp_assd_amt ( AUC: 0.9236613 )
Possible leakage detected in training data for feature cmp_assd_cnt ( AUC: 0.9216365 )


For more information about how leakage is detected, refer to our documentation: https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/expert_settings/features_settings.html#check-leakage


DAI Input Feature Reduction


If the reason a column is dropped is listed as "DAI Input Feature Reduction", it means that the number of columns in the dataset is greater than the allowed number of columns in the model.  


The number of allowed columns in the model can be controlled in the Expert Settings by the following parameters: 


If this is the reason the column was dropped, you will find more information in the logs.  Here is an example of the message you will see:


Dropping the following weak features with small importance due to columns=32 > max_orig_cols_selected=10 or > ncol_nonnum_effective non-numeric columns=7 > max_orig_nonnumeric_cols_selected=3: or > ncol_num_effective numeric columns=22 > max_orig_numeric_cols_selected=10000000: <<<['Gender', 'EducationField', 'JobRole', 'OverTime']>>>


In the above example, Driverless AI dropped the features: Gender, EducationField, JobRole, and OverTime because they had low variable importance and due to my settings, it knew it had to build a model with:

  • a maximum of 10 columns
  • a maximum of 3 non-numeric columns


DAI Model Dropped


There are multiple reasons why AutoDoc may list that the feature is not in the model because the "DAI Model Dropped it".  This section describes the types of reasons that may occur and how you can find more information about it.


Reason #1: There was no transformer available for that feature


This can happen if the transformer that handles the type of data is turned off.  For example, if the ImageTransformer is turned off, then any image column will have to be dropped from the model. You can determine if this is the case with your model by downloading the logs and searching for the column that was dropped.


If it was dropped because no transformer was available, then you will find a message like this in the logs:

- Feature engineering search space: Original
- Feature engineering selected/enabled but inapplicable: Raw
LightGBMModel is not using 7 features: ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']
LightGBMModel *default* feature->transformer map
Age: ['OriginalTransformer']
DailyRate: ['OriginalTransformer']
DistanceFromHome: ['OriginalTransformer']
BusinessTravel: [None]
Department: [None]
XGBoostGBMModel is not using 2 features: ['BusinessTravel', 'Department']


Reason #2: Driverless AI is running out of memory


If Driverless AI begins to run out of memory (this can often occur for very large datasets), then it will start to drop columns.


In this case, you will see OOM warnings in the logs and mention that columns are being dropped.


For more information about how this response is triggered, see the documentation: https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/expert_settings/system_settings.html#allow-reduce-features-when-failure



Reason #3: The Interpretability Setting Reduced the Complexity of the Model


Driverless AI may choose to drop features because the Interpretability Knob restricts the complexity of the model.  If this occurs, you will see a message like the following in your Notifications and in your logs:

Applied limits on transformations and features to some individuals, who will be re-scored. This includes restricting transformed feature count to no more than original feature count, up to cap on feature count of nfeatures_max_threshold=200
This can lead to worse scores.
If unexpected, set limit_features_by_interpretability=False in expert settings, specify own limit via nfeatures_max in expert settings, change nfeatures_max_threshold in config.toml, or change map from interpretability to transformed feature count in config.toml, currently: features_allowed_by_interpretability={1: 10000000, 2: 10000, 3: 1000, 4: 500, 5: 300, 6: 200, 7: 150, 8: 100, 9: 80, 10: 50, 11: 50, 12: 50, 13: 50}.
Individual 0 : Reduced transformations from 61 -> 33 and features from 116 -> 61
Individual 1 : Reduced transformations from 61 -> 53 and features from 69 -> 61
Individual 2 : Reduced transformations from 61 -> 16 and features from 116 -> 61
Individual 3 : Reduced transformations from 61 -> 21 and features from 116 -> 61

Note: While this notification will tell you that some features were removed due to the Interpretability knob it will not tell you which features were removed.


Reason #4: Driverless AI Dropped it During the Genetic Algorithm


The genetic algorithm can decide to drop features for multiple reasons.  This is Driverless AI's IP and the reasons why a column is changed by the Genetic Algorithm can change for any release.


The main reasons why Driverless AI would choose to drop a column are:

  1. There was low variable importance for that feature and depending on the importance threshold it was dropped

  2. Driverless AI tried removing the feature as trial and error and the model performed better without the feature so it was dropped 

Of these reasons, the logs will not describe why it occurred.