Goal

The goal of this document is to discuss the limitations of Transforming a Dataset in Driverless AI.  This document is for internal guidance.  The Transform functionality can very easily lead to over-fitting and leakage if not performed correctly so these steps should not be done by the customer without support from the H2O team.


TABLE OF CONTENTS


Overview

In this section, we will discuss the ways to Transform a dataset in Driverless AI.


Transform a Dataset using UI/Python Client


Driverless AI offers the ability to transform a dataset in the UI - see: https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/transform-another-dataset.html This is also possible in the Python client (though not officially supported).


The Transform action performs two methods of transforming:
  1. Fitting and Transforming - performed on training dataset
  2. Only Transforming - performed on validation dataset and test dataset


Fitting and Transforming


What does it do?

Fitting and transforming combines two steps.  First, a feature is fitted to the training dataset. Then the training dataset is transformed with that fitted feature.


As an example, let's say one of Driverless AI's new features is a Cluster ID.  To create a feature like Cluster ID, a clustering model must first be trained on data to identify clusters.  Then a cluster label can be applied.


In this example, the "Fitting" is training the clustering model to identify clusters.  The "Transform" is when the Cluster ID is applied to the rows of data.


What dataset is used?


The "Fitting and Transforming" step is applied to the data provided as the "Train Dataset" in the UI.  Note: This dataset can be any dataset with the same column names as the Experiment's Train Dataset.  It does not actually need to be the same data as the Experiment's Train Dataset.



Only Transforming


What does it do?


Transforming simply looks up the features that were derived in the "Fitting and Transforming" step and applies them.


Transforming simply looks up the features that were derived in the "Fitting and Transforming" step and applies them.  Back to our example with Cluster ID, when "Transforming" is done, the clusters are simply looked up and the Cluster ID is applied to the rows of data.  No clustering model needs to be trained for this step.


What dataset is used?


The "Transforming" step is applied to the Validation Dataset and Test Dataset.  


Note: If you do not provide a Validation Dataset in the UI, the Training Dataset will automatically be split into two pieces (Training and Validation).



Example in Python Client


Not Officially Supported

# Transform data using existing model/experiment
transform = dai._backend.fit_transform_batch_sync(model_key=ex.key,
                                                 training_dataset_key=ex.datasets.get('train_dataset').key,
                                                 validation_dataset_key=new_data.key,
                                                 test_dataset_key=None,
                                                 validation_split_fraction=0.25,
                                                 seed=1234,
                                                 fold_column=None)

# Download the new transformed data
transform_newdata_path = dai._backend.download(src_path=transform.validation_output_csv_path, 
                                                  dest_path="./newdata_transformed.csv")
# Import data
import pandas as pd
transformed_newdata = pd.read_csv(transform_newdata_path)


Transform a Dataset using MOJO

Not Officially Supported


The transform option in MOJO is specifically for the Big Data Recipe.  It is experimental and should not be used without support from the DAI engineering team.


It is possible to generate a MOJO that returns the engineered features instead of the predictions.  This is the basis for the Big Data Recipe.


To create this MOJO, you must turn on make_mojo_scoring_pipeline_for_features_only=True.  This will create a MOJO that only returns engineered features, not predictions.


If the experiment has already been created, you can retrain the final pipeline and change the make_mojo_scoring_pipeline_for_features_only setting to True.  Note: This will not generate the exact same experiment.

# Retrain final pipeline with new parameters
new_ex = ex.retrain(final_pipeline_only=True, make_mojo_scoring_pipeline_for_features_only=True)
mojo_path = new_ex.artifacts.download('mojo_pipeline', overwrite=True)


Limitations


General Limitations

Regardless of how Transform is applied, the following limitations exist:


  1. Only the features from one single pipeline are created - the pipeline with the most features
    • This means that if the model is an ensemble and different features are used for different models within that pipeline, the Transform will not result in all engineered features.

  2. The user needs to understand how data leakage can occur.  
    • The user should never use cross validation on the training dataset to validate a model - there will be data leakage.
    • The user should not build an ensemble of models - there could be data leakage.


Leakage


Leakage can occur if transformed datasets are not used correctly because some of the engineered features use the target value.  Engineered features that use the target value employ internal cross validation to prevent data leakage.  In cross validated target encoding (CVTE), for example, the mean of the target is always calculated on out of fold data.  See How Driverless AI Prevents Overfitting and Leakage for more details.


Below is an example of how these engineered features will differ between the Transformed Train Dataset and Transformed Validation or Test Dataset.


Transformed Train Dataset


Even though Target Encoding generally means the average of the target for the group, you will notice that there are multiple unique values of CVTE within Business Travel = Level 1.  This is because the average is taken on out of fold data.  In this example, Driverless AI used 5-folds to create the Cross Validation Target Encoding. 


Business TravelCVTE Business Travel
Level 10.153
Level 10.147
Level 10.151
Level 10.149


Transformed Validation or Test Dataset


In the Transformed Validation or Test Dataset, there is simply one value of Cross Validation Target Encoding for each group.  This is because Cross Validation is no longer necessary to prevent data leakage.  The Validation and Test dataset will not be the data used to train a new model.  For the Validation and Test dataset, the average of the Target is calculated across the whole training dataset and simply applied to the Validation and Test dataset.  (The Fit Transform happens on the Training dataset and the Transform happens on the Validation and Test dataset).


Business TravelCVTE Business Travel
Level 10.15


Using the Datasets


When you use the Transform button in the UI, you will see that the Training Dataset should be used for Training, Validation Dataset should be used for Parameter tuning and the Testing Dataset should be used for final scoring.  



The reason for these explicit suggestions is to prevent the user from data leakage.  If the data scientist uses cross validation to tune the model instead of the Validation data provided, they can accidentally leak the answer via the Cross Validated Target Encoding (or other type features). 


Driverless AI does not run into this problem because internally it uses CCV (Cross-Cross-Validation).  This is cross-validation within a fold to create features like target encoding.  After a train and validation pair has been determined, then only the train part is undergoing another k-fold procedure where K-1 parts are used to estimate average target values for the selected categories and apply those to the Kth part until the train dataset has its mean (target) values estimated in k batches. They can be applied at the same time to the outer valid dataset, taking an average of all the K-folds’ mean values. 


The data scientist will not be able to perform CCV on the transformed datasets since the work of transforming is separated from the model building.  Therefore, they need to explicitly use a Validation dataset for model tuning.


Limitations of Transform in UI/Python Client


When using the Transform capability in the UI or the Python Client, the "Fitting and Transforming" step is run.  This means the features will be created from scratch with whatever data you provide as your training dataset.  The engineered features are not the same as used in the model, it is simply refitting the same pipeline.  


With default settings, Driverless AI will perform cross validation and the fit/transform will be performed on different folds of data.  In this case, the features will be slightly different.


If the engineered features are only row-based transformations (i.e. adding columns together, log transform, etc.), the features will match.  This, however, will almost never be the case unless the user has restricted the types of feature transformations that can be applied.


Limitations of Transform in MOJO


If the user wants to create the engineered features with the MOJO, they will not be able to get the predictions from the MOJO.  Additionally, this setting for the MOJO was specifically made for the Big Data Recipe.  What the setting does can change the experiment and me modified between releases based on the needs of the Big Data Recipe.