Prediction of Small Molecule Lipophilicity: Part 2 - TPOT (Tree-based Pipeline Optimization Tool) With Morgan (Circular) Fingerprints

Previous articles in this series:

Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration

1. Introduction

Using Morgan (circular) fingerprints as input data, we create an ensemble of Scikit-learn based pipelines, via TPOT (Tree-based Pipeline Optimization Tool), to predict lipophilicity values. See Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration for the link to the raw data as well as code for converting SMILES strings to fingerprints via RDKit.

2. TPOT

TPOT is an automated machine learning library based on Scikit-learn and XGBoost that uses genetic programming (a genetic algorithm with a tree-like data structure) to create machine learning pipelines.

Below we show the TPOT API as well as the parameters that we used. The entire code is displayed at the end of this article.

ml_model = TPOTRegressor(generations=gens, population_size=pop, scoring=error_metric,
                         max_time_mins=mins, cv=cv_folds, verbosity=verbose,
                         n_jobs=num_jobs, early_stop=early_stop_generations,
                         max_eval_time_mins=mins_per_pipeline,
                         config_dict=tpot_config_dict,
                         periodic_checkpoint_folder=checkpoint_folder)

    # tpot parameters                
    cvfolds = 3
    errormetric = 'neg_mean_squared_error'
    numjobs = -1
    generations = 1000
    population_size = 100
    max_time_mins = 8*60
    max_mins_per_pipeline = 10
    early_stop_gens = 10
    verbosity = 0

n_jobs=-1 specifies the use of all available CPUs (16 for the Amazon Web Services g3.4xlarge machine that we used). We set generations to a large number as we use a maximum execution time, max_time_mins, and early stopping, early_stop (number of generations without improvement of the scoring [here mean squared error]), to control run time. max_eval_time_mins prevents TPOT from getting stuck on a particular pipeline (this can sometimes happen with Scikit-learn’s implementation of the support vector machine algorithm).

Specifying periodic_checkpoint_folder tells TPOT to print a pipeline on the current Pareto front to a Python file in that directory once per generation but no more often then once per 30 seconds. We can then use these pipelines later to form an ensemble with the final TPOT result.

Setting config_dict allows us to specify which preprocessors, feature selectors, machine learning algorithms along with allowable parameters to use. The default dictionary can be found here. See the code at the end of this article for our custom dictionary.

Final TPOT output is both a Python file and, if specified, a Scikit-learn pipeline. Intermediate results via the checkpoint folder are python files. To use these results for the production data, we must cut and paste them into a Python file.

The final (best) TPOT pipeline was:

    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    StackingEstimator(estimator=LassoLarsCV(normalize=False)),
    SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", 
        shrinking=True, tol=0.001))

As each input data point is a binary vector of length 2048, it is not surprising that the support vector machine algorithm performed well.

FunctionTransformer(copy)

object allows for a basic form of stacking when a classifier is present in the middle of a pipeline. FunctionTransformer(copy) makes a copy of the entire dataset, and that is merged with the predictions of a classifier on that dataset.

FeatureUnion

applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

StackingEstimator is a builtin TPOT utility to facilitate stacking.

Generation 7 yielded the following interesting pipeline.

    exported_pipeline = make_pipeline(
    StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, 
                                    kernel="rbf", shrinking=False, tol=0.1)),
    StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.25, 
                      min_samples_leaf=11, min_samples_split=2, n_estimators=100)),
    SelectFwe(score_func=f_regression, alpha=0.023),
    ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, 
                        min_samples_split=4, n_estimators=150))

Here, we have the following data flow:
1. original input data to SVR yields 1st prediction
2. original input data and 1st prediction (numpy hstack) to RandomForestRegressor yields 2nd prediction
3. original input data and 1st prediction and 2nd prediction (numpy hstack) to SelectFwe to ExtraTreesRegressor to final output

SelectFwe is a feature selector.

3. Results

For the best (final) TPOT result as well as an ensemble, we present our results. The ensemble is the median result from the best result for each generation as well as the overall best result. We used a hold out part of the data set (labeled production in the code at the end of this article) to obtain results.

TPOT Lipophilicity Errors

	Generation 1	Generation 2	Generation 3	Generation 4	Generation 5	Generation 6	Generation 7	Generation 8	Generation 9	Generation 10	Best	Ensemble Median
Mean Squared Error	0.6376	0.6260	0.6351	0.6357	0.6298	0.6343	0.6294	0.5882	0.5918	0.5917	0.5882	0.6188
Median Absolute Error	0.4204	0.4267	0.4202	0.4221	0.4197	0.4223	0.4206	0.4149	0.4313	0.4350	0.4149	0.4134
Correlation Coefficient	0.7674	0.7694	0.7687	0.7685	0.7707	0.7682	0.7711	0.7838	0.7803	0.7804	0.7838	0.7738
R2	0.5769	0.5846	0.5786	0.5782	0.5821	0.5791	0.5824	0.6097	0.6073	0.6074	0.6097	0.5894

Below is a comparison of our results to those of the Molecule Net benchmark from MoleculeNet: A Benchmark for Molecular Machine Learning (page 51).

Our Results Versus Benchmark Results

Model	Root Mean Squared Error
Random Forest (RF)	0.876 +/- 0.040
Multitask (Multilayer Perceptron with multiple outputs)	0.859 +/- 0.013
XGBoost	0.799 +/- 0.054
Kernel Ridge Regression (KRR)	0.899 +/- 0.043
Graph Convolutional Network (GC)	0.655 +/- 0.036
Directed Acyclic Graph (DAG)	0.835 +/- 0.039
Weave	0.715 +/- 0.035
Message Passing Neural Network (MPNN)	0.719 +/- 0.031
TPOT Best	0.767
TPOT Ensemble	0.787
Multilayer Perceptron Ensemble	0.823
1D CNN Fingerprints	0.857
2D CNN Fingerprints	0.844

R2 results in barchart form can be seen at the Molecule Net site (see section Physical Chemistry -> lipo).

The benchmark results were obtained by running multiple random permutations of the dataset, which we did not do, so the comparisons with our results are not exact. However, it is interesting to note that the use of stacking and ensembling with Morgan fingerprint features yields results that are competitive with various graph convolutional models which are expected to out perform other types of models.

4. Calibration Code

import pandas as pd
import numpy as np
import os
from pathlib import Path
import joblib

from sklearn.metrics import mean_squared_error  # ‘neg_mean_squared_error’
from sklearn.metrics import median_absolute_error  # ‘neg_median_absolute_error’
from sklearn.metrics import r2_score

from tpot import TPOTRegressor
    
# *******************************************************************

def tpot_dictionary(num_input, step):
    
    tpot_config_dict = {

    'sklearn.linear_model.ElasticNetCV': {
        'l1_ratio': np.arange(0.0, 1.01, 0.05),
        'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
    },

    'sklearn.ensemble.ExtraTreesRegressor': {
        'n_estimators': [50, 100, 150, 200, 250],
        'max_features': np.arange(0.05, 1.01, 0.1),
        'min_samples_split': range(2, 13),
        'min_samples_leaf': range(1, 13),
        'bootstrap': [True, False]
    },

    'sklearn.ensemble.GradientBoostingRegressor': {
        'n_estimators': [50, 100, 150, 200, 250],
        'loss': ["ls", "lad", "huber", "quantile"],
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'max_depth': range(1, 11),
        'min_samples_split': range(2, 13),
        'min_samples_leaf': range(1, 13),
        'subsample': np.arange(0.05, 1.01, 0.1),
        'max_features': np.arange(0.05, 1.01, 0.1),
        'alpha': [0.75, 0.8, 0.85, 0.9, 0.95, 0.99]
    },

    'sklearn.ensemble.AdaBoostRegressor': {
        'n_estimators': [50, 100, 150, 200, 250],
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'loss': ["linear", "square", "exponential"]
    },

    'sklearn.neighbors.KNeighborsRegressor': {
        'n_neighbors': np.arange(1,101,2,dtype='int'),
        'weights': ["uniform", "distance"],
        'p': [2]
    },

    'sklearn.linear_model.LassoLarsCV': {
        'normalize': [True, False]
    },

    'sklearn.svm.LinearSVR': {
        'loss': ["epsilon_insensitive", "squared_epsilon_insensitive"],
        'dual': [True, False],
        'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.],
        'epsilon': [1e-4, 1e-3, 1e-2, 1e-1, 1.]
    },

    'sklearn.svm.SVR': {
        'kernel': ['rbf'],
        'cache_size': [1000],
        'shrinking': [True, False],
        'gamma': [5.0e-5, 1.0e-2, 1.0, 1.0e3],
        'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        'C': [1.0e-3, 1.0e-1, 1.0, 100.0, 1.0e4],
        'epsilon': [1e-4, 1e-3, 1e-2, 1e-1, 1.]
    },

    'sklearn.ensemble.RandomForestRegressor': {
        'n_estimators': [50, 100, 150, 200, 250],
        'max_features': np.arange(0.05, 1.01, 0.2),
        'min_samples_split': range(2, 13),
        'min_samples_leaf': range(1, 13),
        'bootstrap': [True, False]
    },

    'sklearn.linear_model.RidgeCV': {
    },

    'xgboost.XGBRegressor': {
        'n_estimators': np.linspace(25,250,num=10,endpoint=True,dtype='int'),
        'max_depth': range(1, 11),
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'subsample': np.arange(0.05, 1.01, 0.1),
        'min_child_weight': range(1, 13),
        'colsample_bylevel': [0.7, 0.85, 1.0],
        'nthread': [1],
        'objective': ['reg:squarederror']
    },

    # Preprocesssors

    'sklearn.decomposition.FastICA': {
        'tol': np.arange(0.0, 1.01, 0.05)
    },

#    'sklearn.preprocessing.MaxAbsScaler': {
#    },
#
#    'sklearn.preprocessing.MinMaxScaler': {
#    },
#
#    'sklearn.preprocessing.Normalizer': {
#        'norm': ['l1', 'l2', 'max']
#    },

    'sklearn.decomposition.PCA': {
        'n_components': np.arange(num_input//2, num_input+1, 
                                  step, dtype='int'),
        'svd_solver': ['randomized'],
        'iterated_power': range(1, 11)
    },

#    'sklearn.preprocessing.PolynomialFeatures': {
#        'degree': [2,3],
#        'include_bias': [False],
#        'interaction_only': [False]
#    },
#
#    'sklearn.kernel_approximation.RBFSampler': {
#        'gamma': np.arange(0.0, 1.01, 0.05)
#    },
#
#    'sklearn.preprocessing.RobustScaler': {
#    },
#
#    'sklearn.preprocessing.StandardScaler': {
#    },


    # Selectors
    'sklearn.feature_selection.SelectFwe': {
        'alpha': np.arange(0, 0.05, 0.001),
        'score_func': {
            'sklearn.feature_selection.f_regression': None
        }
    },

    'sklearn.feature_selection.SelectPercentile': {
        'percentile': np.arange(5,101,5,dtype=int),
        'score_func': {
            'sklearn.feature_selection.f_regression': None
        }
    },

    'sklearn.feature_selection.VarianceThreshold': {
        'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
    },

    'sklearn.feature_selection.SelectFromModel': {
        'threshold': np.arange(0, 1.01, 0.1),
        'estimator': {
            'sklearn.ensemble.ExtraTreesRegressor': {
                'n_estimators': [100],
                'max_features': np.arange(0.1, 1.01, 0.1)
            }
        }
    }

    }
    
        
    return tpot_config_dict
    
# end of tpot_dictionary()    

# *******************************************************************

def tpot_regression(x_calib, y_calib, x_prod, y_prod, results_direct,
                    cv_folds, error_metric, num_jobs, gens, pop,
                    mins, mins_per_pipeline, verbose, early_stop_generations,
                    tpot_config_dict, model_name='tpot_best'):
        
    checkpoint_folder = results_direct + 'checkpoint_folder/'
    if not Path(checkpoint_folder).is_dir():
        os.mkdir(checkpoint_folder)
       
    ml_model = TPOTRegressor(generations=gens, population_size=pop, scoring=error_metric,
                             max_time_mins=mins, cv=cv_folds, verbosity=verbose,
                             n_jobs=num_jobs, early_stop=early_stop_generations,
                             max_eval_time_mins=mins_per_pipeline,
                             config_dict=tpot_config_dict,
                             periodic_checkpoint_folder=checkpoint_folder)
    
    ml_model.fit(x_calib, y_calib)
    
    # save entire pipeline
    ml_model.export(results_direct + model_name + '.py')
    joblib.dump(ml_model.fitted_pipeline_, results_direct + model_name + '.sav')
   
    # for cross valdation errors see the exported model py file
        
    # production - results and errors
    y_prod_predict = ml_model.predict(x_prod)
    np.save(results_direct + model_name + '_prod_predicted.npy', y_prod_predict)
    
    df_prod_errors = pd.DataFrame(index=['Mean Squared Error','Median Absolute Error',
                                         'Correlation Coefficient','R2'])
    df_prod_errors['TPOT Best'] = [mean_squared_error(y_prod, y_prod_predict),
                                   median_absolute_error(y_prod, y_prod_predict),
                                   np.corrcoef(y_prod, y_prod_predict)[0][-1],
                                   r2_score(y_prod, y_prod_predict)]
    df_prod_errors.to_csv(results_direct + model_name + '_prod_errors.csv')        
        
# end of tpot_regression()

# *******************************************************************
        
if __name__ == '__main__':
        
    calibration_fraction = 0.8
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
    
    results_directory = base_directory + 'results_tpot_calibration/'
    if not Path(results_directory).is_dir():
        os.mkdir(results_directory)
    
    # get data
    xcalib = np.load(data_directory + 'x_calib_fingerprints_1d.npy')
    ycalib = np.load(data_directory + 'y_calib.npy')
    
    xprod = np.load(data_directory + 'x_prod_fingerprints_1d.npy')
    yprod = np.load(data_directory + 'y_prod.npy')
        
    # tpot parameters                
    cvfolds = 3
    errormetric = 'neg_mean_squared_error'
    numjobs = -1
    generations = 1000
    population_size = 100
    max_time_mins = 8*60
    max_mins_per_pipeline = 10
    early_stop_gens = 10
    verbosity = 1
        
    print('\nstart time:',pd.Timestamp.now())
    
    num_features = xcalib.shape[1]
    step_size = 32
    
    tpot_configuration_dictionary = tpot_dictionary(num_features, step_size)

    tpot_regression(xcalib, ycalib, xprod, yprod, results_directory,
                    cvfolds, errormetric, numjobs, generations, population_size,
                    max_time_mins, max_mins_per_pipeline, verbosity, early_stop_gens,
                    tpot_configuration_dictionary)

    print('\nending at:',pd.Timestamp.now())

5. Production (Unseen Data) Results Code

import os
from pathlib import Path
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

from sklearn.linear_model import LassoLarsCV
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.pipeline import make_pipeline, make_union
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn.preprocessing import FunctionTransformer

from copy import copy

from tpot.builtins import StackingEstimator

import matplotlib.pyplot as plt

# *******************************************************************

# pipelines are the best from each generation + best at end

def tpot_pipelines(x_calib, y_calib, x_prod, y_prod, results_dir):
    
    df_calib_errors = pd.DataFrame(index=['Mean Squared Error','Median Absolute Error',
                                          'Correlation Coefficient','R2'])
    
    df_prod_errors = pd.DataFrame(index=['Mean Squared Error','Median Absolute Error',
                                          'Correlation Coefficient','R2'])
    
    df_prod_predicted_all = pd.DataFrame(data=y_prod, columns=['Actual'])
    
    name = 'Generation 1'# Average CV score on the training set was: -0.6230351190025942
    exported_pipeline = SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, 
                            kernel="rbf", shrinking=False, tol=1e-05)
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 2'# Average CV score on the training set was: -0.621065613709162
    exported_pipeline = SVR(C=100.0, cache_size=1000, epsilon=0.1, gamma=0.01, 
                            kernel="rbf", shrinking=True, tol=0.1)
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 3'# Average CV score on the training set was: -0.621255934447933
    exported_pipeline = make_pipeline(
    StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, 
                                    gamma=0.01, kernel="rbf", shrinking=False, tol=0.1)),
    ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, 
                        min_samples_leaf=3, min_samples_split=4, n_estimators=150))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 4'# Average CV score on the training set was: -0.6207066714526902
    exported_pipeline = make_pipeline(
    StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, 
                                    kernel="rbf", shrinking=False, tol=0.1)),
    ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, 
                        min_samples_split=4, n_estimators=150))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 5'# Average CV score on the training set was: -0.6188447925587955
    exported_pipeline = make_pipeline(
    StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, 
                                    kernel="rbf", shrinking=False, tol=0.1)),
    StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.25, 
                      min_samples_leaf=11, min_samples_split=2, n_estimators=100)),
    ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, 
                        min_samples_leaf=6, min_samples_split=4, n_estimators=150))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 6'# Average CV score on the training set was: -0.6218534825381317
    exported_pipeline = SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, 
                            kernel="rbf", shrinking=False, tol=0.1)
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 7'# Average CV score on the training set was: -0.6187763172835291
    exported_pipeline = make_pipeline(
    StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, 
                                    kernel="rbf", shrinking=False, tol=0.1)),
    StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.25, 
                      min_samples_leaf=11, min_samples_split=2, n_estimators=100)),
    SelectFwe(score_func=f_regression, alpha=0.023),
    ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, 
                        min_samples_split=4, n_estimators=150))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 8'# Average CV score on the training set was: -0.5963075462519467
    exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    StackingEstimator(estimator=LassoLarsCV(normalize=False)),
    SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", 
        shrinking=True, tol=0.001))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 9'# Average CV score on the training set was: -0.6035834254008958
    exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", 
        shrinking=False, tol=0.0001))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Generation 10'# Average CV score on the training set was: -0.6031886072873389
    exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", 
        shrinking=True, tol=0.001))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
    
    name = 'Best'# Average CV score on the training set was: -0.5963075462519467
    exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    StackingEstimator(estimator=LassoLarsCV(normalize=False)),
    SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", 
        shrinking=True, tol=0.001))
    exported_pipeline.fit(x_calib, y_calib)
    results = exported_pipeline.predict(x_prod)
    
    y_predicted_calib = exported_pipeline.predict(x_calib)
    
    df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib),
                             median_absolute_error(y_calib, y_predicted_calib),
                             np.corrcoef(y_calib, y_predicted_calib)[0][-1],
                             r2_score(y_calib, y_predicted_calib)]
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]
    
    df_prod_predicted_all[name] = results
       
    # ensemble
    name = 'Emsemble Median'
    results = np.median(df_prod_predicted_all.values, axis=1)
    df_prod_predicted_all[name] = results
    
    df_prod_errors[name] = [mean_squared_error(y_prod, results),
                            median_absolute_error(y_prod, results),
                            np.corrcoef(y_prod, results)[0][-1],
                            r2_score(y_prod, results)]

    # save
    df_calib_errors.to_pickle(results_dir + 'df_calib_errors.pkl')
    df_calib_errors.to_csv(results_dir + 'df_calib_errors.csv')
    
    df_prod_errors.to_pickle(results_dir + 'df_prod_errors.pkl')
    df_prod_errors.to_csv(results_dir + 'df_prod_errors.csv')
    
    df_prod_predicted_all.to_pickle(results_dir + 'df_prod_predicted_all.pkl')
    
# end of tpot_pipelines()
    
    
# ******************************************************************* 

if __name__ == '__main__':
    
    base_directory = 'C:/python/ml_projects_other/lipophilicity/'
    data_directory = base_directory + 'lipo_tpot_aws/'
    
    results_directory = base_directory + 'results/'
    if not Path(results_directory).is_dir():
        os.mkdir(results_directory)
    
    xcalib = np.load(data_directory + 'x_calib_fingerprints_1d.npy')
    ycalib = np.load(data_directory + 'y_calib.npy')
    
    xprod = np.load(data_directory + 'x_prod_fingerprints_1d.npy')
    yprod = np.load(data_directory + 'y_prod.npy')
    
    tpot_pipelines(xcalib, ycalib, xprod, yprod, results_directory)

Prediction of Small Molecule Lipophilicity: Part 2 – TPOT (Tree-based Pipeline Optimization Tool) With Morgan (Circular) Fingerprints