Prediction of Small Molecule Lipophilicity: Part 5 - Ensemble of 2D Convolutional Neural Networks With Morgan (Circular) Fingerprints

Previous articles in this series:

Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration
Prediction of Small Molecule Lipophilicity: Part 2 – TPOT (Tree-based Pipeline Optimization Tool) With Morgan (Circular) Fingerprints
Prediction of Small Molecule Lipophilicity: Part 3 – Ensemble of Multilayer Perceptrons With Morgan (Circular) Fingerprints
Prediction of Small Molecule Lipophilicity: Part 4 – Ensemble of 1D Convolutional Neural Networks With Morgan (Circular) Fingerprints

1. Introduction

Using Morgan (circular) fingerprints as input data, we create an ensemble of 2D convolutional neural networks (CNNs) with Keras, to predict lipophilicity values. See Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration for the link to the raw data as well as code for converting SMILES strings to fingerprints via RDKit.

2. Data Shaping

Morgan fingerprints are 1D Numpy arrays of length 2048 consisting of 1s and 0s, which are then reshaped to (height, width) = (32, 64), see: Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration. In the CNN we are using the channels first option with a single channel. First, we read in a Numpy array with all of the input data and the reshape it to (number of data points, 1, height, width), where 1 = number of channels (see the main code at the end of this article).

Next, we must ensure that the correct data shape is specified. We have

input_tensor = Input(shape=(1, height_size, width_size))

In the 2D convolutional and max pooling layers we specify data_format=’channels_first’ (see model construction code at the end of this article).

Additionally, we must be careful with the sizes of the convolutional kernels and pooling. We use a fixed kernel_size of (3, 3), pool_size of (2, 2), and stride of 2. The number of filters is set to an initial value (which is a hyperparameter) for the first convolutional block, then it is allowed to change but is constrained by a maximum value (also a hyperparameter), (see model construction code at the end of this article).

3. Neural Network Parameters

We use randomized parameter search for the parameters of the multilayer perceptrons.

import numpy as np
from sklearn.model_selection import ParameterSampler

def parameters_cnn(num_samples):
    
    batch_size = [32, 64]
    
    dropout = np.linspace(0.0, 0.5, num=11, endpoint=True)
    
    activation = ['relu','elu']
    
    cnn_filters_initial = [16, 32]    
    cnn_filters_maximum = [32, 64, 128]    
    cnn_blocks = [2, 3]  
      
    layer_1_nodes_num = [32, 64, 128, 256]
        
    grid = {'batch_size':batch_size, 'dropout':dropout,
            'filters_initial':cnn_filters_initial, 
            'filters_maximum':cnn_filters_maximum,
            'blocks':cnn_blocks, 'activation':activation,
            'layer_1_nodes_num':layer_1_nodes_num}
   
    parameter_grid = list(ParameterSampler(grid, n_iter=num_samples))
       
    return parameter_grid

To create an ensemble of neural networks, we set a cross validation error threshold and keep 10 models with cross validation errors below the threshold. These are the parameters of those models.

2D CNN Hyperparameters

layer_1_nodes_num	filters_maximum	filters_initial	dropout	blocks	batch_size	activation	Model Name	Metric	Mean CV Error	Max CV Epochs
128	128	32	0	2	32	relu	cnn_2d_fingerprints_104	mean_squared_error	0.002777	19
64	64	32	0.05	2	32	relu	cnn_2d_fingerprints_3	mean_squared_error	0.002942	37
128	32	32	0.05	2	64	relu	cnn_2d_fingerprints_82	mean_squared_error	0.003037	27
64	32	16	0	2	64	relu	cnn_2d_fingerprints_103	mean_squared_error	0.00311	20
128	32	32	0	2	64	relu	cnn_2d_fingerprints_197	mean_squared_error	0.003126	16
32	64	32	0.05	2	32	relu	cnn_2d_fingerprints_189	mean_squared_error	0.00315	33
256	64	16	0.05	2	64	relu	cnn_2d_fingerprints_122	mean_squared_error	0.003217	26
256	128	32	0	3	32	elu	cnn_2d_fingerprints_173	mean_squared_error	0.003224	9
64	128	32	0	3	32	relu	cnn_2d_fingerprints_151	mean_squared_error	0.003237	10
128	64	32	0	3	64	relu	cnn_2d_fingerprints_71	mean_squared_error	0.003279	14

4. Results

Results were produced using unseen data.

Lipophilicity Errors

	Mean Squared Error	Root Mean Squared Error	Median Absolute Error	Correlation Coefficient	R2
cnn_2d_fingerprints_104	0.794	0.8911	0.503	0.6974	0.4732
cnn_2d_fingerprints_3	0.7759	0.8808	0.5386	0.7041	0.4852
cnn_2d_fingerprints_82	0.7954	0.8919	0.536	0.7037	0.4722
cnn_2d_fingerprints_103	0.7774	0.8817	0.5154	0.704	0.4842
cnn_2d_fingerprints_197	0.8026	0.8959	0.4865	0.6946	0.4674
cnn_2d_fingerprints_189	0.7679	0.8763	0.5539	0.7088	0.4905
cnn_2d_fingerprints_122	0.816	0.9033	0.5338	0.6948	0.4586
cnn_2d_fingerprints_173	0.9104	0.9541	0.6087	0.6602	0.3959
cnn_2d_fingerprints_151	0.9268	0.9627	0.5469	0.6435	0.3851
cnn_2d_fingerprints_71	0.9789	0.9894	0.6028	0.6218	0.3505
Ensemble Median	0.7115	0.8435	0.4828	0.7305	0.5279

Below is a comparison of our results to those of the Molecule Net benchmark from MoleculeNet: A Benchmark for Molecular Machine Learning (page 51).

Our Results Versus Benchmark Results

Model	Root Mean Squared Error
Random Forest (RF)	0.876 +/- 0.040
Multitask (Multilayer Perceptron with multiple outputs)	0.859 +/- 0.013
XGBoost	0.799 +/- 0.054
Kernel Ridge Regression (KRR)	0.899 +/- 0.043
Graph Convolutional Network (GC)	0.655 +/- 0.036
Directed Acyclic Graph (DAG)	0.835 +/- 0.039
Weave	0.715 +/- 0.035
Message Passing Neural Network (MPNN)	0.719 +/- 0.031
TPOT Best	0.767
TPOT Ensemble	0.787
Multilayer Perceptron Ensemble	0.823
1D CNN Fingerprints	0.857
2D CNN Fingerprints	0.844

R2 results in barchart form can be seen at the Molecule Net site (see section Physical Chemistry -> lipo).

The benchmark results were obtained by running multiple random permutations of the dataset, which we did not do, so the comparisons with our results are not exact.

5. Code

Model construction function:

from keras.layers import Dense, Input, Dropout
from keras.layers import Conv2D, MaxPooling2D, Flatten
from keras.models import Model
from keras.optimizers import Adam

def cnn_2d_model_construction(dict_params_cnn, num_output_nodes,
                              loss_type, metric, height_size, width_size):
    
    format_of_data = 'channels_first'
    
    conv_kernel_size = (3,3)
    padding_type = 'valid'
    
    poolsize = (2, 2)
    strides_value = 2
    
    input_tensor = Input(shape=(1, height_size, width_size))
    
    # 1st block
    x = Conv2D(filters=dict_params_cnn['filters_initial'], 
               kernel_size=conv_kernel_size, 
               activation=dict_params_cnn['activation'],
               padding = padding_type,
               data_format=format_of_data)(input_tensor)
    x = Dropout(dict_params_cnn['dropout'])(x)    
    x = MaxPooling2D(pool_size=poolsize, strides=strides_value,
                     data_format=format_of_data)(x)
    
    num_filters = 2*dict_params_cnn['filters_initial']
    for iblock in range(dict_params_cnn['blocks'] - 1):
        x = Conv2D(filters=num_filters, 
                   kernel_size=conv_kernel_size, 
                   activation=dict_params_cnn['activation'],
                   padding = padding_type,
                   data_format=format_of_data)(x)
        x = Dropout(dict_params_cnn['dropout'])(x)    
        x = MaxPooling2D(pool_size=poolsize, strides=strides_value,
                         data_format=format_of_data)(x)
        
        num_filters = min(2*num_filters, dict_params_cnn['filters_maximum'])
 
    x = Flatten()(x)
    
    x = Dropout(dict_params_cnn['dropout'])(x)
    
    x = Dense(dict_params_cnn['layer_1_nodes_num'], 
              activation = dict_params_cnn['activation'])(x)
    
    x = Dropout(dict_params_cnn['dropout'])(x)
    
    # target E [-0.15, 0.45] so no need of activation function
    output_tensor = Dense(num_output_nodes)(x)
       
    model_cnn_f = Model(input_tensor, output_tensor)
     
    # compile the model
    opt = Adam(lr=0.00025)  # default = 0.001
    model_cnn_f.compile(optimizer=opt, loss=loss_type, metrics=[metric])
     
    return model_cnn_f

Cross validation, thresholding, and saving good models function:

import numpy as np
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

def nn_cv(x_calib, y_calib, cv_folds, metric,
          nn_model, nn_model_copy, batchsize, num_epochs, stop_epochs, verbosity,
          models_dir, model_name, threshold, learn_rate_epochs):
      
    callbacks_list = [EarlyStopping(monitor='val_loss', patience=stop_epochs),                     
    ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=learn_rate_epochs, 
                      verbose=0, mode='auto', min_lr=1.0e-6)]
    
    cv_num = int(x_calib.shape[0]/cv_folds)
    list_cv_metrics = []
    list_cv_epochs = []
    for fold in range(cv_folds):
        # get train/valid
        x_train = np.vstack((x_calib[:cv_num*fold], x_calib[cv_num*(fold+1):]))        
        y_train = np.hstack((y_calib[:cv_num*fold], y_calib[cv_num*(fold+1):]))
         
        x_valid = x_calib[cv_num*fold:cv_num*(fold+1)]
        y_valid = y_calib[cv_num*fold:cv_num*(fold+1)]
        
        # fit the model
        h = nn_model.fit(x_train, y_train, epochs=num_epochs, 
                         batch_size=batchsize, validation_data=(x_valid, y_valid), 
                         callbacks=callbacks_list, verbose=verbosity)
     
        # collect cv stats
        list_cv_metrics.append(np.min(h.history['val_' + metric]))
        epoch_best_early_stop = len(h.history['val_loss']) - stop_epochs
        list_cv_epochs.append(epoch_best_early_stop)
                 
    dict_return = {}
    mean_cv_error = np.mean(list_cv_metrics)
    max_cv_epochs = np.max(list_cv_epochs)
    print('mean_cv_error =',mean_cv_error,'\tthreshold =',threshold)
    print('max_cv_epochs =',max_cv_epochs)
    
    if mean_cv_error < threshold:
        # fit the model (use copy) with all calib data
        h_refit = nn_model_copy.fit(x_calib, y_calib, epochs=max_cv_epochs, 
                                    batch_size=batchsize, verbose=0)
        nn_model_copy.save(models_dir + model_name + '.h5')
        dict_return['Model Name'] = model_name
        dict_return['Metric'] = metric
        dict_return['Mean CV Error'] = mean_cv_error
        dict_return['Max CV Epochs'] = max_cv_epochs
     
    return dict_return

In the main code below, we note that we divided the lipophilicity values by 10 so that their range is [-0.15, 0.45] and thus no output activation function is used.

import pandas as pd
import numpy as np
import time
import os
from pathlib import Path
import sys
sys.path.append(YOUR DIRECTORY)
import neural_network_ensemble

if __name__ == '__main__':
    
    calculation_type = 'calibration'  # production  calibration
    nn_type = 'cnn_2d'
    target_divisor = 10.0  # target E [-0.15, 0.45] so no need of activation function
        
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
    
    results_directory_stub = base_directory + nn_type + '_fingerprints/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
    
    results_directory = results_directory_stub + 'results_' + nn_type + '_fingerprints/'
    if not Path(results_directory).is_dir():
        os.mkdir(results_directory)
    
    models_directory = results_directory_stub + 'models_' + nn_type + '_fingerprints/'
    if not Path(models_directory).is_dir():
        os.mkdir(models_directory)
   
    # limits and params
    metric_type = 'mean_squared_error'
    type_of_loss = metric_type
    verbose = 0
    epochs = 200
    checkpoint_epochs_stop = 15
    learning_rate_epochs = 10
    
    cross_valid_folds = 3
    size_of_parameter_list = 1000
    error_threshold = 0.008
    max_time_minutes = 6*60
    number_of_models = 10
        
    if calculation_type == 'calibration':
        print('\n\n*** starting calibration at',pd.Timestamp.now())
        start_time_calibration = time.time()
            
        # get data
        ycalib = np.load(data_directory + 'y_calib.npy')/target_divisor   
        xcalib = np.load(data_directory + 'x_calib_fingerprints_2d.npy')
       
        try:
            number_of_outputs = ycalib.shape[1]
        except:
            number_of_outputs = 1  # ycalib.shape[1]
   
        # reshape data
        num_pts = xcalib.shape[0]
        height = xcalib[0].shape[0]
        width = xcalib[0].shape[1]
        # height*width = length_of_fingerprint_vector = 2048
        xcalib = xcalib.reshape(num_pts, 1, height, width)
        
        # set up parameters (stochastic)
        list_param_dict = parameters_cnn(size_of_parameter_list)
        list_dict_results = []
        counter = 0
        # loop over parameters
        for dictionary_parameter in list_param_dict:   
            
            modelname = nn_type + '_fingerprints_' + str(counter)
            counter = counter + 1
                
            print('\n\n*** starting at',pd.Timestamp.now())
            start_time_loop = time.time()            
            
            model_nn = cnn_2d_model_construction(dictionary_parameter, 
                                                 number_of_outputs,
                                                 type_of_loss, metric_type,
                                                 height, width) 
            print(dictionary_parameter)                     
            
            dict_results = nn_cv(xcalib, ycalib, 
                                    cross_valid_folds, metric_type,
                                    model_nn, model_nn, 
                                    dictionary_parameter['batch_size'], epochs, 
                                    checkpoint_epochs_stop, verbose,
                                    models_directory, modelname, error_threshold,
                                    learning_rate_epochs)
            
            elapsed_time_minutes = (time.time() - start_time_loop)/60
            print('*** elapsed time =',elapsed_time_minutes,' mins\tcounter =',counter-1)
            
            # add epochs and modelname to param dict and save it
            if len(dict_results) > 0:
                dictionary_parameter.update(dict_results)
                list_dict_results.append(dictionary_parameter)
                
            # check elapsed time for early stopping
            elapsed_time_minutes = (time.time() - start_time_calibration)/60
            if elapsed_time_minutes > max_time_minutes:
                break
        
        print('\n\n*** ending calibation at',pd.Timestamp.now())
        print('elapsed time =',(time.time()-start_time_calibration)/60,' min')
        
        # collect results in a df, then save
        df_param_results = pd.DataFrame(data=list_dict_results)
        df_param_results.to_pickle(models_directory + 'df_parameter_results.pkl')
        df_param_results.to_csv(models_directory + 'df_parameter_results.csv')
        list_dict_results.clear()
            
    elif calculation_type == 'production':
        
        # get data
        yprod = np.load(data_directory + 'y_prod.npy')  # UNALTERED (SEE CREATE_ENSEMBLE)
        xprod = np.load(data_directory + 'x_prod_fingerprints_2d.npy')
        
        # reshape
        num_pts = xprod.shape[0]
        height = xprod[0].shape[0]
        width = xprod[0].shape[1]
        # height*width = length_of_fingerprint_vector = 2048
        xprod = xprod.reshape(num_pts, 1, height, width)
            
        df_results = pd.read_pickle(models_directory + 'df_parameter_results.pkl')
                      
        neural_network_ensemble.create_ensemble('regression', 
                       'Mean CV Error', 'Keras', 
                       df_results, number_of_models, models_directory, 
                       results_directory, xprod, yprod, target_divisor)
        
    else:
        raise NameError

For the production (unseen data) code, we sort the cross validation results then read in the 10 best models. The final result is the median lipophilicity prediction of these models. See Prediction of Small Molecule Lipophilicity: Part 3 – Ensemble of Multilayer Perceptrons With Morgan (Circular) Fingerprints

Prediction of Small Molecule Lipophilicity: Part 5 – Ensemble of 2D Convolutional Neural Networks With Morgan (Circular) Fingerprints