Previous articles in this series:
1. Introduction
Using Morgan (circular) fingerprints as input data, we create an ensemble of Scikit-learn based pipelines, via TPOT (Tree-based Pipeline Optimization Tool), to predict lipophilicity values. See Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration for the link to the raw data as well as code for converting SMILES strings to fingerprints via RDKit.
2. TPOT
TPOT is an automated machine learning library based on Scikit-learn and XGBoost that uses genetic programming (a genetic algorithm with a tree-like data structure) to create machine learning pipelines.
Below we show the TPOT API as well as the parameters that we used. The entire code is displayed at the end of this article.
ml_model = TPOTRegressor(generations=gens, population_size=pop, scoring=error_metric, max_time_mins=mins, cv=cv_folds, verbosity=verbose, n_jobs=num_jobs, early_stop=early_stop_generations, max_eval_time_mins=mins_per_pipeline, config_dict=tpot_config_dict, periodic_checkpoint_folder=checkpoint_folder)
# tpot parameters cvfolds = 3 errormetric = 'neg_mean_squared_error' numjobs = -1 generations = 1000 population_size = 100 max_time_mins = 8*60 max_mins_per_pipeline = 10 early_stop_gens = 10 verbosity = 0
n_jobs=-1 specifies the use of all available CPUs (16 for the Amazon Web Services g3.4xlarge machine that we used). We set generations to a large number as we use a maximum execution time, max_time_mins, and early stopping, early_stop (number of generations without improvement of the scoring [here mean squared error]), to control run time. max_eval_time_mins prevents TPOT from getting stuck on a particular pipeline (this can sometimes happen with Scikit-learn’s implementation of the support vector machine algorithm).
Specifying periodic_checkpoint_folder tells TPOT to print a pipeline on the current Pareto front to a Python file in that directory once per generation but no more often then once per 30 seconds. We can then use these pipelines later to form an ensemble with the final TPOT result.
Setting config_dict allows us to specify which preprocessors, feature selectors, machine learning algorithms along with allowable parameters to use. The default dictionary can be found here. See the code at the end of this article for our custom dictionary.
Final TPOT output is both a Python file and, if specified, a Scikit-learn pipeline. Intermediate results via the checkpoint folder are python files. To use these results for the production data, we must cut and paste them into a Python file.
The final (best) TPOT pipeline was:
make_union( FunctionTransformer(copy), FunctionTransformer(copy) ), StackingEstimator(estimator=LassoLarsCV(normalize=False)), SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", shrinking=True, tol=0.001))
As each input data point is a binary vector of length 2048, it is not surprising that the support vector machine algorithm performed well.
FunctionTransformer(copy)
object allows for a basic form of stacking when a classifier is present in the middle of a pipeline. FunctionTransformer(copy) makes a copy of the entire dataset, and that is merged with the predictions of a classifier on that dataset.
applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.
StackingEstimator is a builtin TPOT utility to facilitate stacking.
Generation 7 yielded the following interesting pipeline.
exported_pipeline = make_pipeline( StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.1)), StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.25, min_samples_leaf=11, min_samples_split=2, n_estimators=100)), SelectFwe(score_func=f_regression, alpha=0.023), ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, min_samples_split=4, n_estimators=150))
Here, we have the following data flow:
1. original input data to SVR yields 1st prediction
2. original input data and 1st prediction (numpy hstack) to RandomForestRegressor yields 2nd prediction
3. original input data and 1st prediction and 2nd prediction (numpy hstack) to SelectFwe to ExtraTreesRegressor to final output
SelectFwe is a feature selector.
3. Results
For the best (final) TPOT result as well as an ensemble, we present our results. The ensemble is the median result from the best result for each generation as well as the overall best result. We used a hold out part of the data set (labeled production in the code at the end of this article) to obtain results.
TPOT Lipophilicity Errors
Generation 1 | Generation 2 | Generation 3 | Generation 4 | Generation 5 | Generation 6 | Generation 7 | Generation 8 | Generation 9 | Generation 10 | Best | Ensemble Median | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean Squared Error | 0.6376 | 0.6260 | 0.6351 | 0.6357 | 0.6298 | 0.6343 | 0.6294 | 0.5882 | 0.5918 | 0.5917 | 0.5882 | 0.6188 |
Median Absolute Error | 0.4204 | 0.4267 | 0.4202 | 0.4221 | 0.4197 | 0.4223 | 0.4206 | 0.4149 | 0.4313 | 0.4350 | 0.4149 | 0.4134 |
Correlation Coefficient | 0.7674 | 0.7694 | 0.7687 | 0.7685 | 0.7707 | 0.7682 | 0.7711 | 0.7838 | 0.7803 | 0.7804 | 0.7838 | 0.7738 |
R2 | 0.5769 | 0.5846 | 0.5786 | 0.5782 | 0.5821 | 0.5791 | 0.5824 | 0.6097 | 0.6073 | 0.6074 | 0.6097 | 0.5894 |


Below is a comparison of our results to those of the Molecule Net benchmark from MoleculeNet: A Benchmark for Molecular Machine Learning (page 51).
Our Results Versus Benchmark Results
Model | Root Mean Squared Error |
---|---|
Random Forest (RF) | 0.876 +/- 0.040 |
Multitask (Multilayer Perceptron with multiple outputs) | 0.859 +/- 0.013 |
XGBoost | 0.799 +/- 0.054 |
Kernel Ridge Regression (KRR) | 0.899 +/- 0.043 |
Graph Convolutional Network (GC) | 0.655 +/- 0.036 |
Directed Acyclic Graph (DAG) | 0.835 +/- 0.039 |
Weave | 0.715 +/- 0.035 |
Message Passing Neural Network (MPNN) | 0.719 +/- 0.031 |
TPOT Best | 0.767 |
TPOT Ensemble | 0.787 |
Multilayer Perceptron Ensemble | 0.823 |
1D CNN Fingerprints | 0.857 |
2D CNN Fingerprints | 0.844 |
R2 results in barchart form can be seen at the Molecule Net site (see section Physical Chemistry -> lipo).
The benchmark results were obtained by running multiple random permutations of the dataset, which we did not do, so the comparisons with our results are not exact. However, it is interesting to note that the use of stacking and ensembling with Morgan fingerprint features yields results that are competitive with various graph convolutional models which are expected to out perform other types of models.
4. Calibration Code
import pandas as pd import numpy as np import os from pathlib import Path import joblib from sklearn.metrics import mean_squared_error # ‘neg_mean_squared_error’ from sklearn.metrics import median_absolute_error # ‘neg_median_absolute_error’ from sklearn.metrics import r2_score from tpot import TPOTRegressor # ******************************************************************* def tpot_dictionary(num_input, step): tpot_config_dict = { 'sklearn.linear_model.ElasticNetCV': { 'l1_ratio': np.arange(0.0, 1.01, 0.05), 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1] }, 'sklearn.ensemble.ExtraTreesRegressor': { 'n_estimators': [50, 100, 150, 200, 250], 'max_features': np.arange(0.05, 1.01, 0.1), 'min_samples_split': range(2, 13), 'min_samples_leaf': range(1, 13), 'bootstrap': [True, False] }, 'sklearn.ensemble.GradientBoostingRegressor': { 'n_estimators': [50, 100, 150, 200, 250], 'loss': ["ls", "lad", "huber", "quantile"], 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'max_depth': range(1, 11), 'min_samples_split': range(2, 13), 'min_samples_leaf': range(1, 13), 'subsample': np.arange(0.05, 1.01, 0.1), 'max_features': np.arange(0.05, 1.01, 0.1), 'alpha': [0.75, 0.8, 0.85, 0.9, 0.95, 0.99] }, 'sklearn.ensemble.AdaBoostRegressor': { 'n_estimators': [50, 100, 150, 200, 250], 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'loss': ["linear", "square", "exponential"] }, 'sklearn.neighbors.KNeighborsRegressor': { 'n_neighbors': np.arange(1,101,2,dtype='int'), 'weights': ["uniform", "distance"], 'p': [2] }, 'sklearn.linear_model.LassoLarsCV': { 'normalize': [True, False] }, 'sklearn.svm.LinearSVR': { 'loss': ["epsilon_insensitive", "squared_epsilon_insensitive"], 'dual': [True, False], 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], 'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], 'epsilon': [1e-4, 1e-3, 1e-2, 1e-1, 1.] }, 'sklearn.svm.SVR': { 'kernel': ['rbf'], 'cache_size': [1000], 'shrinking': [True, False], 'gamma': [5.0e-5, 1.0e-2, 1.0, 1.0e3], 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], 'C': [1.0e-3, 1.0e-1, 1.0, 100.0, 1.0e4], 'epsilon': [1e-4, 1e-3, 1e-2, 1e-1, 1.] }, 'sklearn.ensemble.RandomForestRegressor': { 'n_estimators': [50, 100, 150, 200, 250], 'max_features': np.arange(0.05, 1.01, 0.2), 'min_samples_split': range(2, 13), 'min_samples_leaf': range(1, 13), 'bootstrap': [True, False] }, 'sklearn.linear_model.RidgeCV': { }, 'xgboost.XGBRegressor': { 'n_estimators': np.linspace(25,250,num=10,endpoint=True,dtype='int'), 'max_depth': range(1, 11), 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'subsample': np.arange(0.05, 1.01, 0.1), 'min_child_weight': range(1, 13), 'colsample_bylevel': [0.7, 0.85, 1.0], 'nthread': [1], 'objective': ['reg:squarederror'] }, # Preprocesssors 'sklearn.decomposition.FastICA': { 'tol': np.arange(0.0, 1.01, 0.05) }, # 'sklearn.preprocessing.MaxAbsScaler': { # }, # # 'sklearn.preprocessing.MinMaxScaler': { # }, # # 'sklearn.preprocessing.Normalizer': { # 'norm': ['l1', 'l2', 'max'] # }, 'sklearn.decomposition.PCA': { 'n_components': np.arange(num_input//2, num_input+1, step, dtype='int'), 'svd_solver': ['randomized'], 'iterated_power': range(1, 11) }, # 'sklearn.preprocessing.PolynomialFeatures': { # 'degree': [2,3], # 'include_bias': [False], # 'interaction_only': [False] # }, # # 'sklearn.kernel_approximation.RBFSampler': { # 'gamma': np.arange(0.0, 1.01, 0.05) # }, # # 'sklearn.preprocessing.RobustScaler': { # }, # # 'sklearn.preprocessing.StandardScaler': { # }, # Selectors 'sklearn.feature_selection.SelectFwe': { 'alpha': np.arange(0, 0.05, 0.001), 'score_func': { 'sklearn.feature_selection.f_regression': None } }, 'sklearn.feature_selection.SelectPercentile': { 'percentile': np.arange(5,101,5,dtype=int), 'score_func': { 'sklearn.feature_selection.f_regression': None } }, 'sklearn.feature_selection.VarianceThreshold': { 'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2] }, 'sklearn.feature_selection.SelectFromModel': { 'threshold': np.arange(0, 1.01, 0.1), 'estimator': { 'sklearn.ensemble.ExtraTreesRegressor': { 'n_estimators': [100], 'max_features': np.arange(0.1, 1.01, 0.1) } } } } return tpot_config_dict # end of tpot_dictionary() # ******************************************************************* def tpot_regression(x_calib, y_calib, x_prod, y_prod, results_direct, cv_folds, error_metric, num_jobs, gens, pop, mins, mins_per_pipeline, verbose, early_stop_generations, tpot_config_dict, model_name='tpot_best'): checkpoint_folder = results_direct + 'checkpoint_folder/' if not Path(checkpoint_folder).is_dir(): os.mkdir(checkpoint_folder) ml_model = TPOTRegressor(generations=gens, population_size=pop, scoring=error_metric, max_time_mins=mins, cv=cv_folds, verbosity=verbose, n_jobs=num_jobs, early_stop=early_stop_generations, max_eval_time_mins=mins_per_pipeline, config_dict=tpot_config_dict, periodic_checkpoint_folder=checkpoint_folder) ml_model.fit(x_calib, y_calib) # save entire pipeline ml_model.export(results_direct + model_name + '.py') joblib.dump(ml_model.fitted_pipeline_, results_direct + model_name + '.sav') # for cross valdation errors see the exported model py file # production - results and errors y_prod_predict = ml_model.predict(x_prod) np.save(results_direct + model_name + '_prod_predicted.npy', y_prod_predict) df_prod_errors = pd.DataFrame(index=['Mean Squared Error','Median Absolute Error', 'Correlation Coefficient','R2']) df_prod_errors['TPOT Best'] = [mean_squared_error(y_prod, y_prod_predict), median_absolute_error(y_prod, y_prod_predict), np.corrcoef(y_prod, y_prod_predict)[0][-1], r2_score(y_prod, y_prod_predict)] df_prod_errors.to_csv(results_direct + model_name + '_prod_errors.csv') # end of tpot_regression() # ******************************************************************* if __name__ == '__main__': calibration_fraction = 0.8 base_directory = YOUR DIRECTORY data_directory = base_directory + 'data/' results_directory = base_directory + 'results_tpot_calibration/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) # get data xcalib = np.load(data_directory + 'x_calib_fingerprints_1d.npy') ycalib = np.load(data_directory + 'y_calib.npy') xprod = np.load(data_directory + 'x_prod_fingerprints_1d.npy') yprod = np.load(data_directory + 'y_prod.npy') # tpot parameters cvfolds = 3 errormetric = 'neg_mean_squared_error' numjobs = -1 generations = 1000 population_size = 100 max_time_mins = 8*60 max_mins_per_pipeline = 10 early_stop_gens = 10 verbosity = 1 print('\nstart time:',pd.Timestamp.now()) num_features = xcalib.shape[1] step_size = 32 tpot_configuration_dictionary = tpot_dictionary(num_features, step_size) tpot_regression(xcalib, ycalib, xprod, yprod, results_directory, cvfolds, errormetric, numjobs, generations, population_size, max_time_mins, max_mins_per_pipeline, verbosity, early_stop_gens, tpot_configuration_dictionary) print('\nending at:',pd.Timestamp.now())
5. Production (Unseen Data) Results Code
import os from pathlib import Path import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error from sklearn.metrics import median_absolute_error from sklearn.metrics import r2_score from sklearn.linear_model import LassoLarsCV from sklearn.svm import SVR from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor from sklearn.pipeline import make_pipeline, make_union from sklearn.feature_selection import SelectFwe, f_regression from sklearn.preprocessing import FunctionTransformer from copy import copy from tpot.builtins import StackingEstimator import matplotlib.pyplot as plt # ******************************************************************* # pipelines are the best from each generation + best at end def tpot_pipelines(x_calib, y_calib, x_prod, y_prod, results_dir): df_calib_errors = pd.DataFrame(index=['Mean Squared Error','Median Absolute Error', 'Correlation Coefficient','R2']) df_prod_errors = pd.DataFrame(index=['Mean Squared Error','Median Absolute Error', 'Correlation Coefficient','R2']) df_prod_predicted_all = pd.DataFrame(data=y_prod, columns=['Actual']) name = 'Generation 1'# Average CV score on the training set was: -0.6230351190025942 exported_pipeline = SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", shrinking=False, tol=1e-05) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 2'# Average CV score on the training set was: -0.621065613709162 exported_pipeline = SVR(C=100.0, cache_size=1000, epsilon=0.1, gamma=0.01, kernel="rbf", shrinking=True, tol=0.1) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 3'# Average CV score on the training set was: -0.621255934447933 exported_pipeline = make_pipeline( StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.1)), ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=3, min_samples_split=4, n_estimators=150)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 4'# Average CV score on the training set was: -0.6207066714526902 exported_pipeline = make_pipeline( StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.1)), ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, min_samples_split=4, n_estimators=150)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 5'# Average CV score on the training set was: -0.6188447925587955 exported_pipeline = make_pipeline( StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.1)), StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.25, min_samples_leaf=11, min_samples_split=2, n_estimators=100)), ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, min_samples_split=4, n_estimators=150)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 6'# Average CV score on the training set was: -0.6218534825381317 exported_pipeline = SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.1) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 7'# Average CV score on the training set was: -0.6187763172835291 exported_pipeline = make_pipeline( StackingEstimator(estimator=SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.1)), StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.25, min_samples_leaf=11, min_samples_split=2, n_estimators=100)), SelectFwe(score_func=f_regression, alpha=0.023), ExtraTreesRegressor(bootstrap=True, max_features=0.9500000000000002, min_samples_leaf=6, min_samples_split=4, n_estimators=150)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 8'# Average CV score on the training set was: -0.5963075462519467 exported_pipeline = make_pipeline( make_union( FunctionTransformer(copy), FunctionTransformer(copy) ), StackingEstimator(estimator=LassoLarsCV(normalize=False)), SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", shrinking=True, tol=0.001)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 9'# Average CV score on the training set was: -0.6035834254008958 exported_pipeline = make_pipeline( make_union( FunctionTransformer(copy), FunctionTransformer(copy) ), SVR(C=10000.0, cache_size=1000, epsilon=0.01, gamma=0.01, kernel="rbf", shrinking=False, tol=0.0001)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Generation 10'# Average CV score on the training set was: -0.6031886072873389 exported_pipeline = make_pipeline( make_union( FunctionTransformer(copy), FunctionTransformer(copy) ), SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", shrinking=True, tol=0.001)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results name = 'Best'# Average CV score on the training set was: -0.5963075462519467 exported_pipeline = make_pipeline( make_union( FunctionTransformer(copy), FunctionTransformer(copy) ), StackingEstimator(estimator=LassoLarsCV(normalize=False)), SVR(C=10000.0, cache_size=1000, epsilon=0.0001, gamma=0.01, kernel="rbf", shrinking=True, tol=0.001)) exported_pipeline.fit(x_calib, y_calib) results = exported_pipeline.predict(x_prod) y_predicted_calib = exported_pipeline.predict(x_calib) df_calib_errors[name] = [mean_squared_error(y_calib, y_predicted_calib), median_absolute_error(y_calib, y_predicted_calib), np.corrcoef(y_calib, y_predicted_calib)[0][-1], r2_score(y_calib, y_predicted_calib)] df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] df_prod_predicted_all[name] = results # ensemble name = 'Emsemble Median' results = np.median(df_prod_predicted_all.values, axis=1) df_prod_predicted_all[name] = results df_prod_errors[name] = [mean_squared_error(y_prod, results), median_absolute_error(y_prod, results), np.corrcoef(y_prod, results)[0][-1], r2_score(y_prod, results)] # save df_calib_errors.to_pickle(results_dir + 'df_calib_errors.pkl') df_calib_errors.to_csv(results_dir + 'df_calib_errors.csv') df_prod_errors.to_pickle(results_dir + 'df_prod_errors.pkl') df_prod_errors.to_csv(results_dir + 'df_prod_errors.csv') df_prod_predicted_all.to_pickle(results_dir + 'df_prod_predicted_all.pkl') # end of tpot_pipelines() # ******************************************************************* if __name__ == '__main__': base_directory = 'C:/python/ml_projects_other/lipophilicity/' data_directory = base_directory + 'lipo_tpot_aws/' results_directory = base_directory + 'results/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) xcalib = np.load(data_directory + 'x_calib_fingerprints_1d.npy') ycalib = np.load(data_directory + 'y_calib.npy') xprod = np.load(data_directory + 'x_prod_fingerprints_1d.npy') yprod = np.load(data_directory + 'y_prod.npy') tpot_pipelines(xcalib, ycalib, xprod, yprod, results_directory)