Previous articles in this series:
- Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration
- Prediction of Small Molecule Lipophilicity: Part 2 – TPOT (Tree-based Pipeline Optimization Tool) With Morgan (Circular) Fingerprints
- Introduction
- Neural Network Parameters
- Results
- Calibration Code
- Production (Unseen Data) Results Code
1. Introduction
Using Morgan (circular) fingerprints as input data, we create an ensemble of multilayer perceptrons with Keras, to predict lipophilicity values. See Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration for the link to the raw data as well as code for converting SMILES strings to fingerprints via RDKit.
2. Neural Network Parameters
We use randomized parameter search for the parameters of the multilayer perceptrons.
import numpy as np from sklearn.model_selection import ParameterSampler # ******************************************************************* def parameters_mlp(num_samples): batch_size = [32, 64] dropout = np.linspace(0.0, 0.5, num=11, endpoint=True) activation = ['relu','elu'] l1_regularizer = [0.0, 0.0001, 0.001] l2_regularizer = [0.0, 0.0001, 0.001] layers = [3, 4, 5, 6] grid = {'batch_size':batch_size, 'dropout':dropout, 'activation':activation, 'l1_regularizer':l1_regularizer, 'l2_regularizer':l2_regularizer, 'layers':layers} parameter_grid = list(ParameterSampler(grid, n_iter=num_samples)) return parameter_grid
To create an ensemble of neural networks, we set a cross validation error threshold and keep 10 models with cross validation errors below the threshold. These are the parameters of those models.
MLP Hyperparameters
layers | l2_regularizer | l1_regularizer | dropout | batch_size | activation | Model Name | Metric | Mean CV Error | Max CV Epochs |
---|---|---|---|---|---|---|---|---|---|
3 | 0.0001 | 0 | 0.25 | 64 | relu | mlp_fingerprints_3 | mean_squared_error | 0.002240 | 434 |
5 | 0.0001 | 0 | 0.1 | 32 | relu | mlp_fingerprints_47 | mean_squared_error | 0.002254 | 210 |
4 | 0 | 0 | 0.05 | 32 | relu | mlp_fingerprints_46 | mean_squared_error | 0.002303 | 33 |
6 | 0 | 0 | 0.05 | 64 | relu | mlp_fingerprints_61 | mean_squared_error | 0.002396 | 32 |
3 | 0.001 | 0 | 0.1 | 64 | relu | mlp_fingerprints_83 | mean_squared_error | 0.002563 | 173 |
4 | 0.0001 | 0.0001 | 0 | 64 | elu | mlp_fingerprints_63 | mean_squared_error | 0.003375 | 116 |
4 | 0.0001 | 0 | 0.3 | 64 | elu | mlp_fingerprints_74 | mean_squared_error | 0.003416 | 203 |
3 | 0.0001 | 0.0001 | 0.05 | 32 | elu | mlp_fingerprints_56 | mean_squared_error | 0.003460 | 103 |
3 | 0.001 | 0 | 0.3 | 64 | elu | mlp_fingerprints_20 | mean_squared_error | 0.003519 | 103 |
4 | 0.0001 | 0 | 0.3 | 32 | elu | mlp_fingerprints_84 | mean_squared_error | 0.003595 | 129 |
3. Results
Results were produced using unseen data.
Lipophilicity Errors
Mean Squared Error | Root Mean Squared Error | Median Absolute Error | Correlation Coefficient | R2 | |
---|---|---|---|---|---|
mlp_fingerprints_3 | 0.6162 | 0.785 | 0.4432 | 0.7706 | 0.5911 |
mlp_fingerprints_47 | 0.6775 | 0.8231 | 0.4752 | 0.7486 | 0.5505 |
mlp_fingerprints_46 | 0.6653 | 0.8156 | 0.4556 | 0.7493 | 0.5586 |
mlp_fingerprints_61 | 0.6949 | 0.8336 | 0.4965 | 0.7434 | 0.5389 |
mlp_fingerprints_83 | 0.6306 | 0.7941 | 0.4416 | 0.763 | 0.5816 |
mlp_fingerprints_63 | 0.9106 | 0.9543 | 0.5638 | 0.6624 | 0.3958 |
mlp_fingerprints_74 | 1.1835 | 1.0879 | 0.5775 | 0.6301 | 0.2148 |
mlp_fingerprints_56 | 0.8448 | 0.9192 | 0.5608 | 0.6802 | 0.4394 |
mlp_fingerprints_20 | 0.8633 | 0.9291 | 0.5339 | 0.6842 | 0.4272 |
mlp_fingerprints_84 | 1.114 | 1.0554 | 0.5923 | 0.6425 | 0.2609 |
Ensemble Median | 0.6766 | 0.8226 | 0.4674 | 0.7457 | 0.5511 |
Below is a comparison of our results to those of the Molecule Net benchmark from MoleculeNet: A Benchmark for Molecular Machine Learning (page 51).
Our Results Versus Benchmark Results
Model | Root Mean Squared Error |
---|---|
Random Forest (RF) | 0.876 +/- 0.040 |
Multitask (Multilayer Perceptron with multiple outputs) | 0.859 +/- 0.013 |
XGBoost | 0.799 +/- 0.054 |
Kernel Ridge Regression (KRR) | 0.899 +/- 0.043 |
Graph Convolutional Network (GC) | 0.655 +/- 0.036 |
Directed Acyclic Graph (DAG) | 0.835 +/- 0.039 |
Weave | 0.715 +/- 0.035 |
Message Passing Neural Network (MPNN) | 0.719 +/- 0.031 |
TPOT Best | 0.767 |
TPOT Ensemble | 0.787 |
Multilayer Perceptron Ensemble | 0.823 |
1D CNN Fingerprints | 0.857 |
2D CNN Fingerprints | 0.844 |
R2 results in barchart form can be seen at the Molecule Net site (see section Physical Chemistry -> lipo).
The benchmark results were obtained by running multiple random permutations of the dataset, which we did not do, so the comparisons with our results are not exact.
4. Calibration Code
Model construction function:
from keras.layers import Dense, Input, Dropout from keras.models import Model from keras.optimizers import Adam from keras import regularizers def mlp_model_construction(dict_params_mlp, num_output_nodes, input_length, loss_type, metric): l1_reg = dict_params_mlp['l1_regularizer'] l2_reg = dict_params_mlp['l2_regularizer'] input_tensor = Input(shape=(input_length,)) # layer 1 number_of_nodes = input_length//2 x = Dense(number_of_nodes, activation=dict_params_mlp['activation'], kernel_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg), activity_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg))(input_tensor) x = Dropout(dict_params_mlp['dropout'])(x) # layers for layer in range(dict_params_mlp['layers'] - 1): number_of_nodes = number_of_nodes//2 x = Dense(number_of_nodes, activation=dict_params_mlp['activation'], kernel_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg), activity_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg))(x) x = Dropout(dict_params_mlp['dropout'])(x) output_tensor = Dense(num_output_nodes)(x) model_mlp_f = Model(input_tensor, output_tensor) # compile the model opt = Adam(lr=0.00025) # default = 0.001 model_mlp_f.compile(optimizer=opt, loss=loss_type, metrics=[metric]) return model_mlp_f
Cross validation, thresholding, and saving good models function:
import numpy as np from keras.callbacks import EarlyStopping, ReduceLROnPlateau def nn_cv(x_calib, y_calib, cv_folds, metric, nn_model, nn_model_copy, batchsize, num_epochs, stop_epochs, verbosity, models_dir, model_name, threshold, learn_rate_epochs): callbacks_list = [EarlyStopping(monitor='val_loss', patience=stop_epochs), ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=learn_rate_epochs, verbose=0, mode='auto', min_lr=1.0e-6)] cv_num = int(x_calib.shape[0]/cv_folds) list_cv_metrics = [] list_cv_epochs = [] for fold in range(cv_folds): # get train/valid x_train = np.vstack((x_calib[:cv_num*fold], x_calib[cv_num*(fold+1):])) y_train = np.hstack((y_calib[:cv_num*fold], y_calib[cv_num*(fold+1):])) x_valid = x_calib[cv_num*fold:cv_num*(fold+1)] y_valid = y_calib[cv_num*fold:cv_num*(fold+1)] # fit the model h = nn_model.fit(x_train, y_train, epochs=num_epochs, batch_size=batchsize, validation_data=(x_valid, y_valid), callbacks=callbacks_list, verbose=verbosity) # collect cv stats list_cv_metrics.append(np.min(h.history['val_' + metric])) epoch_best_early_stop = len(h.history['val_loss']) - stop_epochs list_cv_epochs.append(epoch_best_early_stop) dict_return = {} mean_cv_error = np.mean(list_cv_metrics) max_cv_epochs = np.max(list_cv_epochs) print('mean_cv_error =',mean_cv_error,'\tthreshold =',threshold) print('max_cv_epochs =',max_cv_epochs) if mean_cv_error < threshold: # fit the model (use copy) with all calib data h_refit = nn_model_copy.fit(x_calib, y_calib, epochs=max_cv_epochs, batch_size=batchsize, verbose=0) nn_model_copy.save(models_dir + model_name + '.h5') dict_return['Model Name'] = model_name dict_return['Metric'] = metric dict_return['Mean CV Error'] = mean_cv_error dict_return['Max CV Epochs'] = max_cv_epochs dict_return['Batch Size'] = batchsize return dict_return
In the main code below, we note that we divided the lipophilicity values by 10 so that their range is [-0.15, 0.45] and thus no output activation function is used.
if __name__ == '__main__': calculation_type = 'production' # production calibration nn_type = 'mlp' target_divisor = 10.0 # target E [-0.15, 0.45] so no need of activation function base_directory = YOUR DIRECTORY data_directory = base_directory + 'data/' results_directory_stub = base_directory + nn_type + '_fingerprints/' if not Path(results_directory_stub).is_dir(): os.mkdir(results_directory_stub) results_directory = results_directory_stub + 'results_' + nn_type + '_fingerprints/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) models_directory = results_directory_stub + 'models_' + nn_type + '_fingerprints/' if not Path(models_directory).is_dir(): os.mkdir(models_directory) metric_type = 'mean_squared_error' type_of_loss = metric_type verbose = 0 #2 epochs = 500 #200 checkpoint_epochs_stop = 15 learning_rate_epochs = 10 cross_valid_folds = 3 size_of_parameter_list = 1000 error_threshold = 0.015 max_time_minutes = 6*60 number_of_models = 10 if calculation_type == 'calibration': print('\n\n*** starting calibration at',pd.Timestamp.now()) start_time_calibration = time.time() # get data ycalib = np.load(data_directory + 'y_calib.npy')/target_divisor xcalib = np.load(data_directory + 'x_calib_fingerprints_1d.npy') length_of_input = xcalib[0].shape[0] # length_of_fingerprint_vector = 2048 try: number_of_outputs = ycalib.shape[1] except: number_of_outputs = 1 # ycalib.shape[1] # set up parameters (stochastic) list_param_dict = parameters_mlp(size_of_parameter_list) list_dict_results = [] counter = 0 # loop over parameters for dictionary_parameter in list_param_dict: modelname = nn_type + '_fingerprints_' + str(counter) counter = counter + 1 print('\n\n*** starting at',pd.Timestamp.now()) start_time_loop = time.time() model_nn = mlp_model_construction(dictionary_parameter, number_of_outputs, length_of_input, type_of_loss, metric_type) print(dictionary_parameter) dict_results = nn_cv(xcalib, ycalib, cross_valid_folds, metric_type, model_nn, model_nn, dictionary_parameter['batch_size'], epochs, checkpoint_epochs_stop, verbose, models_directory, modelname, error_threshold, learning_rate_epochs) elapsed_time_minutes = (time.time() - start_time_loop)/60 print('*** elapsed time =',elapsed_time_minutes,' mins\tcounter =',counter-1) # add epochs and modelname to param dict and save it if len(dict_results) > 0: dictionary_parameter.update(dict_results) list_dict_results.append(dictionary_parameter) # check elapsed time for early stopping elapsed_time_minutes = (time.time() - start_time_calibration)/60 if elapsed_time_minutes > max_time_minutes: break print('\n\n*** ending calibation at',pd.Timestamp.now()) print('elapsed time =',(time.time()-start_time_calibration)/60,' min') # collect results in a df, then save df_param_results = pd.DataFrame(data=list_dict_results) df_param_results.to_pickle(models_directory + 'df_parameter_results.pkl') df_param_results.to_csv(models_directory + 'df_parameter_results.csv') list_dict_results.clear() elif calculation_type == 'production': # get data yprod = np.load(data_directory + 'y_prod.npy') # UNALTERED (SEE CREATE_ENSEMBLE) xprod = np.load(data_directory + 'x_prod_fingerprints_1d.npy') length_of_input = xprod[0].shape[0] # length_of_fingerprint_vector = 2048 df_results = pd.read_pickle(models_directory + 'df_parameter_results.pkl') neural_network_ensemble.create_ensemble('regression', 'Mean CV Error', 'Keras', df_results, number_of_models, models_directory, results_directory, xprod, yprod, target_divisor) else: raise NameError
5. Production (Unseen Data) Results Code
We sort the cross validation results then read in the 10 best models. The final result is the median lipophilicity prediction of these models.
import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error from sklearn.metrics import median_absolute_error from sklearn.metrics import r2_score # ******************************************************************* # y_prod MUST BE THE ORINGINAL UNALTERED VALUES # y_pred_multiplier = divisor in the calibration code def create_ensemble(problem_type, sort_column_name, nn_library_name, dfresults, num_models, models_dir, results_dir, x_prod, y_prod, y_pred_multiplier=1.0): assert(dfresults.shape[0] > 0) num_models = min(num_models, dfresults.shape[0]) if nn_library_name == 'Keras': from keras.models import load_model as load_model_nn else: print('\ninvalid nn library name:',nn_library_name) raise NameError if problem_type == 'regression': dfr = dfresults.sort_values(by=sort_column_name, ascending=True, inplace=False) dfr = dfr[:num_models] df_production_errors = pd.DataFrame(index=['Mean Squared Error', 'Median Absolute Error', 'Correlation Coefficient','R2']) list_predictions = [] for i in range(num_models): model_name = dfr.loc[dfr.index[i],'Model Name'] model_final = load_model_nn(models_dir + model_name + '.h5') y_prod_predict = model_final.predict(x_prod)*y_pred_multiplier list_predictions.append(y_prod_predict) y_prod_predict_temp = y_prod_predict.reshape(y_prod_predict.shape[0],) df_production_errors[model_name] = [mean_squared_error(y_prod, y_prod_predict), median_absolute_error(y_prod, y_prod_predict), np.corrcoef(y_prod, y_prod_predict_temp)[0][1], r2_score(y_prod, y_prod_predict)] # compute ensemble array_predictions = np.hstack((list_predictions)) median_prediction = np.median(array_predictions, axis=1) df_production_errors['Ensemble Median'] = [mean_squared_error(y_prod, median_prediction), median_absolute_error(y_prod, median_prediction), np.corrcoef(y_prod, median_prediction)[0][-1], r2_score(y_prod, median_prediction)] df_production_results = pd.DataFrame(data=y_prod, columns=['Actual']) df_production_results['Ensemble Predicted'] = median_prediction elif problem_type == 'classification': print('\nno coding yet for classification') raise NameError else: raise NameError dfr.to_pickle(results_dir + 'df_parameters_ensemble.pkl') dfr.to_csv(results_dir + 'df_parameters_ensemble.csv') df_production_results.to_pickle(results_dir + 'df_production_results.pkl') df_production_results.to_csv(results_dir + 'df_production_results.csv') df_production_errors.to_pickle(results_dir + 'df_production_errors.pkl') df_production_errors.to_csv(results_dir + 'df_production_errors.csv')