Previous articles in this series:

  1. Introduction
  2. Neural Network Parameters
  3. Results
  4. Calibration Code
  5. Production (Unseen Data) Results Code

1. Introduction

Using Morgan (circular) fingerprints as input data, we create an ensemble of multilayer perceptrons with Keras, to predict lipophilicity values. See Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration for the link to the raw data as well as code for converting SMILES strings to fingerprints via RDKit.

2. Neural Network Parameters

We use randomized parameter search for the parameters of the multilayer perceptrons.

import numpy as np
from sklearn.model_selection import ParameterSampler

# *******************************************************************
    
def parameters_mlp(num_samples):    
    batch_size = [32, 64]    
    dropout = np.linspace(0.0, 0.5, num=11, endpoint=True)    
    activation = ['relu','elu']
    l1_regularizer = [0.0, 0.0001, 0.001]
    l2_regularizer = [0.0, 0.0001, 0.001]   
    layers = [3, 4, 5, 6]
        
    grid = {'batch_size':batch_size, 'dropout':dropout,
            'activation':activation, 
            'l1_regularizer':l1_regularizer, 
            'l2_regularizer':l2_regularizer,
            'layers':layers}
   
    parameter_grid = list(ParameterSampler(grid, n_iter=num_samples))
    
    return parameter_grid

To create an ensemble of neural networks, we set a cross validation error threshold and keep 10 models with cross validation errors below the threshold. These are the parameters of those models.

MLP Hyperparameters

layersl2_regularizerl1_regularizerdropoutbatch_sizeactivationModel NameMetricMean CV ErrorMax CV Epochs
30.000100.2564relumlp_fingerprints_3mean_squared_error0.002240434
50.000100.132relumlp_fingerprints_47mean_squared_error0.002254210
4000.0532relumlp_fingerprints_46mean_squared_error0.00230333
6000.0564relumlp_fingerprints_61mean_squared_error0.00239632
30.00100.164relumlp_fingerprints_83mean_squared_error0.002563173
40.00010.0001064elumlp_fingerprints_63mean_squared_error0.003375116
40.000100.364elumlp_fingerprints_74mean_squared_error0.003416203
30.00010.00010.0532elumlp_fingerprints_56mean_squared_error0.003460103
30.00100.364elumlp_fingerprints_20mean_squared_error0.003519103
40.000100.332elumlp_fingerprints_84mean_squared_error0.003595129

3. Results

Results were produced using unseen data.

Lipophilicity Errors

Mean Squared ErrorRoot Mean Squared ErrorMedian Absolute ErrorCorrelation CoefficientR2
mlp_fingerprints_30.61620.7850.44320.77060.5911
mlp_fingerprints_470.67750.82310.47520.74860.5505
mlp_fingerprints_460.66530.81560.45560.74930.5586
mlp_fingerprints_610.69490.83360.49650.74340.5389
mlp_fingerprints_830.63060.79410.44160.7630.5816
mlp_fingerprints_630.91060.95430.56380.66240.3958
mlp_fingerprints_741.18351.08790.57750.63010.2148
mlp_fingerprints_560.84480.91920.56080.68020.4394
mlp_fingerprints_200.86330.92910.53390.68420.4272
mlp_fingerprints_841.1141.05540.59230.64250.2609
Ensemble Median0.67660.82260.46740.74570.5511


Below is a comparison of our results to those of the Molecule Net benchmark from MoleculeNet: A Benchmark for Molecular Machine Learning (page 51).

Our Results Versus Benchmark Results

ModelRoot Mean Squared Error
Random Forest (RF)0.876 +/- 0.040
Multitask (Multilayer Perceptron with multiple outputs)0.859 +/- 0.013
XGBoost0.799 +/- 0.054
Kernel Ridge Regression (KRR)0.899 +/- 0.043
Graph Convolutional Network (GC)0.655 +/- 0.036
Directed Acyclic Graph (DAG)0.835 +/- 0.039
Weave0.715 +/- 0.035
Message Passing Neural Network (MPNN)0.719 +/- 0.031
TPOT Best0.767
TPOT Ensemble0.787
Multilayer Perceptron Ensemble0.823
1D CNN Fingerprints0.857
2D CNN Fingerprints0.844

R2 results in barchart form can be seen at the Molecule Net site (see section Physical Chemistry -> lipo).

The benchmark results were obtained by running multiple random permutations of the dataset, which we did not do, so the comparisons with our results are not exact.

4. Calibration Code

Model construction function:

from keras.layers import Dense, Input, Dropout
from keras.models import Model
from keras.optimizers import Adam
from keras import regularizers
def mlp_model_construction(dict_params_mlp, num_output_nodes,
                           input_length, loss_type, metric):
    
    l1_reg = dict_params_mlp['l1_regularizer']
    l2_reg = dict_params_mlp['l2_regularizer']
    
    input_tensor = Input(shape=(input_length,))
       
    # layer 1
    number_of_nodes = input_length//2
    
    x = Dense(number_of_nodes, activation=dict_params_mlp['activation'],
              kernel_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg),
              activity_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg))(input_tensor)
    
    x = Dropout(dict_params_mlp['dropout'])(x)
        
    # layers
    for layer in range(dict_params_mlp['layers'] - 1):
        number_of_nodes = number_of_nodes//2
        
        x = Dense(number_of_nodes, activation=dict_params_mlp['activation'],
                  kernel_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg),
                  activity_regularizer=regularizers.l1_l2(l1=l1_reg, l2=l2_reg))(x)
        
        x = Dropout(dict_params_mlp['dropout'])(x)

    output_tensor = Dense(num_output_nodes)(x)
       
    model_mlp_f = Model(input_tensor, output_tensor)
     
    # compile the model
    opt = Adam(lr=0.00025)  # default = 0.001
    model_mlp_f.compile(optimizer=opt, loss=loss_type, metrics=[metric])
     
    return model_mlp_f 

Cross validation, thresholding, and saving good models function:

import numpy as np
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

def nn_cv(x_calib, y_calib, cv_folds, metric,
          nn_model, nn_model_copy, batchsize, num_epochs, stop_epochs, verbosity,
          models_dir, model_name, threshold, learn_rate_epochs):
      
    callbacks_list = [EarlyStopping(monitor='val_loss', patience=stop_epochs),                     
    ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=learn_rate_epochs, 
                      verbose=0, mode='auto', min_lr=1.0e-6)]
    
    cv_num = int(x_calib.shape[0]/cv_folds)
    list_cv_metrics = []
    list_cv_epochs = []
    for fold in range(cv_folds):
        # get train/valid
        x_train = np.vstack((x_calib[:cv_num*fold], x_calib[cv_num*(fold+1):]))        
        y_train = np.hstack((y_calib[:cv_num*fold], y_calib[cv_num*(fold+1):]))
         
        x_valid = x_calib[cv_num*fold:cv_num*(fold+1)]
        y_valid = y_calib[cv_num*fold:cv_num*(fold+1)]
        
        # fit the model
        h = nn_model.fit(x_train, y_train, epochs=num_epochs, 
                         batch_size=batchsize, validation_data=(x_valid, y_valid), 
                         callbacks=callbacks_list, verbose=verbosity)
     
        # collect cv stats
        list_cv_metrics.append(np.min(h.history['val_' + metric]))
        epoch_best_early_stop = len(h.history['val_loss']) - stop_epochs
        list_cv_epochs.append(epoch_best_early_stop)
                 
    dict_return = {}
    mean_cv_error = np.mean(list_cv_metrics)
    max_cv_epochs = np.max(list_cv_epochs)
    print('mean_cv_error =',mean_cv_error,'\tthreshold =',threshold)
    print('max_cv_epochs =',max_cv_epochs)
    
    if mean_cv_error < threshold:
        # fit the model (use copy) with all calib data
        h_refit = nn_model_copy.fit(x_calib, y_calib, epochs=max_cv_epochs, 
                                    batch_size=batchsize, verbose=0)
        nn_model_copy.save(models_dir + model_name + '.h5')
        dict_return['Model Name'] = model_name
        dict_return['Metric'] = metric
        dict_return['Mean CV Error'] = mean_cv_error
        dict_return['Max CV Epochs'] = max_cv_epochs
        dict_return['Batch Size'] = batchsize
     
    return dict_return

In the main code below, we note that we divided the lipophilicity values by 10 so that their range is [-0.15, 0.45] and thus no output activation function is used.

if __name__ == '__main__':
    
    calculation_type = 'production'  # production  calibration
    nn_type = 'mlp'
    target_divisor = 10.0  # target E [-0.15, 0.45] so no need of activation function
       
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
    
    results_directory_stub = base_directory + nn_type + '_fingerprints/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
    
    results_directory = results_directory_stub + 'results_' + nn_type + '_fingerprints/'
    if not Path(results_directory).is_dir():
        os.mkdir(results_directory)
    
    models_directory = results_directory_stub + 'models_' + nn_type + '_fingerprints/'
    if not Path(models_directory).is_dir():
        os.mkdir(models_directory)

    metric_type = 'mean_squared_error'
    type_of_loss = metric_type
    verbose = 0  #2
    epochs = 500 #200
    checkpoint_epochs_stop = 15
    learning_rate_epochs = 10
    
    cross_valid_folds = 3
    size_of_parameter_list = 1000
    error_threshold = 0.015
    max_time_minutes = 6*60
    number_of_models = 10
        
    if calculation_type == 'calibration':
        print('\n\n*** starting calibration at',pd.Timestamp.now())
        start_time_calibration = time.time()
            
        # get data
        ycalib = np.load(data_directory + 'y_calib.npy')/target_divisor   
        xcalib = np.load(data_directory + 'x_calib_fingerprints_1d.npy')
        length_of_input = xcalib[0].shape[0] # length_of_fingerprint_vector = 2048
       
        try:
            number_of_outputs = ycalib.shape[1]
        except:
            number_of_outputs = 1  # ycalib.shape[1]
        
        # set up parameters (stochastic)
        list_param_dict = parameters_mlp(size_of_parameter_list)
        list_dict_results = []
        counter = 0
        # loop over parameters
        for dictionary_parameter in list_param_dict:   
            
            modelname = nn_type + '_fingerprints_' + str(counter)
            counter = counter + 1
                
            print('\n\n*** starting at',pd.Timestamp.now())
            start_time_loop = time.time()            
            
            model_nn = mlp_model_construction(dictionary_parameter, 
                                                 number_of_outputs, length_of_input,
                                                 type_of_loss, metric_type) 
            print(dictionary_parameter)                     
            
            dict_results = nn_cv(xcalib, ycalib, 
                                    cross_valid_folds, metric_type,
                                    model_nn, model_nn, 
                                    dictionary_parameter['batch_size'], epochs, 
                                    checkpoint_epochs_stop, verbose,
                                    models_directory, modelname, error_threshold,
                                    learning_rate_epochs)
            
            elapsed_time_minutes = (time.time() - start_time_loop)/60
            print('*** elapsed time =',elapsed_time_minutes,' mins\tcounter =',counter-1)
            
            # add epochs and modelname to param dict and save it
            if len(dict_results) > 0:
                dictionary_parameter.update(dict_results)
                list_dict_results.append(dictionary_parameter)
                
            # check elapsed time for early stopping
            elapsed_time_minutes = (time.time() - start_time_calibration)/60
            if elapsed_time_minutes > max_time_minutes:
                break
        
        print('\n\n*** ending calibation at',pd.Timestamp.now())
        print('elapsed time =',(time.time()-start_time_calibration)/60,' min')
        
        # collect results in a df, then save
        df_param_results = pd.DataFrame(data=list_dict_results)
        df_param_results.to_pickle(models_directory + 'df_parameter_results.pkl')
        df_param_results.to_csv(models_directory + 'df_parameter_results.csv')
        list_dict_results.clear()
            
    elif calculation_type == 'production':
        
        # get data
        yprod = np.load(data_directory + 'y_prod.npy')  # UNALTERED (SEE CREATE_ENSEMBLE)
        xprod = np.load(data_directory + 'x_prod_fingerprints_1d.npy')
        length_of_input = xprod[0].shape[0] # length_of_fingerprint_vector = 2048
            
        df_results = pd.read_pickle(models_directory + 'df_parameter_results.pkl')
                      
        neural_network_ensemble.create_ensemble('regression', 
                       'Mean CV Error', 'Keras', 
                       df_results, number_of_models, models_directory, 
                       results_directory, xprod, yprod, target_divisor)
        
    else:
        raise NameError

5. Production (Unseen Data) Results Code

We sort the cross validation results then read in the 10 best models. The final result is the median lipophilicity prediction of these models.

import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

# *******************************************************************

# y_prod MUST BE THE ORINGINAL UNALTERED VALUES

# y_pred_multiplier = divisor in the calibration code

def create_ensemble(problem_type, sort_column_name, nn_library_name, 
                    dfresults, num_models, models_dir, results_dir,
                    x_prod, y_prod, y_pred_multiplier=1.0):
    
    assert(dfresults.shape[0] > 0)
    num_models = min(num_models, dfresults.shape[0])
    
    if nn_library_name == 'Keras':
        from keras.models import load_model as load_model_nn
    else:
        print('\ninvalid nn library name:',nn_library_name)
        raise NameError
    
    if problem_type == 'regression':
        dfr = dfresults.sort_values(by=sort_column_name, ascending=True, inplace=False)
        dfr = dfr[:num_models]
                    
        df_production_errors = pd.DataFrame(index=['Mean Squared Error', 
                                                   'Median Absolute Error',
                                                   'Correlation Coefficient','R2'])
    
        list_predictions = []
        for i in range(num_models):
            model_name = dfr.loc[dfr.index[i],'Model Name']
            
            model_final = load_model_nn(models_dir + model_name + '.h5')
               
            y_prod_predict = model_final.predict(x_prod)*y_pred_multiplier
            
            list_predictions.append(y_prod_predict)
            
            y_prod_predict_temp = y_prod_predict.reshape(y_prod_predict.shape[0],)
            df_production_errors[model_name] = [mean_squared_error(y_prod, y_prod_predict),
                               median_absolute_error(y_prod, y_prod_predict),
                               np.corrcoef(y_prod, y_prod_predict_temp)[0][1],
                               r2_score(y_prod, y_prod_predict)]
                        
        # compute ensemble
        array_predictions = np.hstack((list_predictions))
        median_prediction = np.median(array_predictions, axis=1)
        
        df_production_errors['Ensemble Median'] = [mean_squared_error(y_prod, median_prediction),
                               median_absolute_error(y_prod, median_prediction),
                               np.corrcoef(y_prod, median_prediction)[0][-1],
                               r2_score(y_prod, median_prediction)]
            
        df_production_results = pd.DataFrame(data=y_prod, 
                                             columns=['Actual'])
        df_production_results['Ensemble Predicted'] = median_prediction
                      
    elif problem_type == 'classification':
        print('\nno coding yet for classification')
        raise NameError
    else:        
        raise NameError
        
    dfr.to_pickle(results_dir + 'df_parameters_ensemble.pkl')
    dfr.to_csv(results_dir + 'df_parameters_ensemble.csv')
    
    df_production_results.to_pickle(results_dir + 'df_production_results.pkl')
    df_production_results.to_csv(results_dir + 'df_production_results.csv')
    
    df_production_errors.to_pickle(results_dir + 'df_production_errors.pkl')
    df_production_errors.to_csv(results_dir + 'df_production_errors.csv')