Prediction of Small Molecule Lipophilicity: Part 1 - Data Exploration

1. Introduction

This is the first in a series of articles in which we use machine learning algorithms to predict lipophilicity values for small molecules. Lipophilicity is defined as ‘octanol/water distribution coefficient (logD at pH 7.4).’

Data is from MoleculeNet: A Benchmark for Molecular Machine Learning. The raw data can be found here (downloadable zip file). Below is a snippet of the raw data.

Lipophilicity Data

CMPD_CHEMBLID	exp	smiles
CHEMBL596271	3.54	Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14
CHEMBL1951080	-1.18	COc1cc(OC)c(cc1NC(=O)CSCC(=O)O)S(=O)(=O)N2C(C)CCc3ccccc23
CHEMBL1771	3.69	COC(=O)[C@@H](N1CCc2sccc2C1)c3ccccc3Cl

The column labeled ‘smiles’ contains SMILES (Simplified Molecular Input Line Entry System) ASCII character representation of molecules. One of the most important issues in applying machine learning to the analysis of small molecules is how to convert representations of molecular information, such as SMILES, into matrices of real numbers, the required input format for machine learning. This process is called featurization.

2. Featurization

Our first method of featurization will entail character embedding of SMILES strings. We use the same approach as in Peptide Binding – Part 1: 1D Convolutional Neural Network.

import numpy as np
import pandas as pd

def smiles_to_embedding():    
    calib_fraction = 0.8
    
    directory = YOUR DATA DIRECTORY
    dfl = pd.read_pickle(directory + 'df_lipophilicity_raw_smiles_shuffled.pkl')
    smiles_array = dfl['smiles'].values
        
    # create sorted, unique array of smiles characters
    array_smiles_unique = np.unique(list("".join(smiles_array)))
        
    # create embedded (not padded)
    mapping = dict(zip(array_smiles_unique, range(1, 1 + len(array_smiles_unique))))
    list_embedded_smiles = []
    for s in smiles_array:
        list_embedded_smiles.append([mapping[i] for i in s])
        
    df_lipophilicity_embedded_smiles_shuffled = pd.DataFrame(index=dfl.index,
                                                             data=dfl.values,
                                                             columns=dfl.columns)
    df_lipophilicity_embedded_smiles_shuffled['smiles embedded'] = list_embedded_smiles
    
    
    df_lipophilicity_embedded_smiles_shuffled.to_pickle(directory + 
                   'df_lipophilicity_embedded_smiles_shuffled.pkl')
    
    
    # split and save as numpy arrays
    calib_num = int(calib_fraction*df_lipophilicity_embedded_smiles_shuffled.shape[0])
    x_calib = df_lipophilicity_embedded_smiles_shuffled['smiles embedded'].values[:calib_num]
    np.save(directory + 'x_calib_smiles_embedded.npy', x_calib)
    x_prod = df_lipophilicity_embedded_smiles_shuffled['smiles embedded'].values[calib_num:]
    np.save(directory + 'x_prod_smiles_embedded.npy', x_prod)

Previously, we read in the raw data, converted it to a Dataframe, and shuffled it.

Now a SMLIES string becomes a 1D vector of integers, in which each ASCII character is assigned a unique number, starting at 1. So, ‘CN(C)C(=O)[C@H](Cc1ccccc1)NC(=O)c2cc3ccsc3[nH]2’ becomes, [22,26,3,22,4,22,3,…]. Later, in neural network codes, we will append 0s on the right so that all vectors have the same length.

Addendum

(We thought that we had made a mistake, then deleted the code and remarks above, and then realized that we did not, and are now restoring the information. Character level embedding of SMILES strings has some odd characteristics. We must remember that SMILES strings encode chemical information at the single and multiple character levels. When we encode such strings at such levels (multiple characters would be “words”), we do not do so based on chemically relevant information. Rather, we are simply attempting to use a CNN or RNN to match patterns of characters. For example, Si is treated as S and i when embedding at the single character level, despite i not having chemical relevance independent of Si. What we assume is that the neural network will learn to associate i with Si and distinguish S from Si in the context of the particular problem to be solved. Here, we assume that the network will learn such distinctions in the process of minimizing the mean squared error between its predictions of lipophilicity and actual values. As we will see in subsequent articles, graph convolutional networks represent the best approach of embedding chemically relevant information for neural networks.)

Morgan (circular) fingerprints are another method of representing molecular information. In the code below, we use RDKit to convert SMILES strings into fingerprints. These are bit vectors, which are subsequently converted into Numpy arrays. Each array is a 1D binary vector, in which 1s represent the existence of a local (radius of 2) molecular feature (see Morgan Fingerprints (Circular Fingerprints) and Feature Definitions Used in the Morgan Fingerprints).

import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem

def smiles_to_morgan_fingerprint():    
    calib_fraction = 0.8
    
    directory = YOUR DATA DIRECTORY
    dfl = pd.read_pickle(directory + 'df_lipophilicity_raw_smiles_shuffled.pkl')
    smiles_array = dfl['smiles'].values
    
    # 1d    
    # generate molecues from smiles strings
    list_molecules = []
    for s in smiles_array:
        list_molecules.append(Chem.MolFromSmiles(s))
                
    # generate fingeprints: Morgan fingerprint with radius 2
    list_fingerprints = []
    for molecule in list_molecules:
        list_fingerprints.append(AllChem.GetMorganFingerprintAsBitVect(molecule, 2))

    # convert the RDKit explicit vectors into numpy arrays
    # each is a binary vector of length 2048
    x_fingerprints_1d = np.asarray(list_fingerprints)
    x_fingerprints_2d = np.reshape(x_fingerprints_1d, 
                                   (x_fingerprints_1d.shape[0], 32, 64))
        
    # split data then save as numpy arrays
    calib_num = int(calib_fraction*dfl.shape[0])
    
    x_calib_1d = x_fingerprints_1d[:calib_num]
    np.save(directory + 'x_calib_fingerprints_1d.npy', x_calib_1d)
    
    x_calib_2d = x_fingerprints_2d[:calib_num]
    np.save(directory + 'x_calib_fingerprints_2d.npy', x_calib_2d)
    
    y_calib = dfl['exp'].values[:calib_num]
    np.save(directory + 'y_calib.npy', y_calib)
        
    x_prod_1d = x_fingerprints_1d[calib_num:]
    np.save(directory + 'x_prod_fingerprints_1d.npy', x_prod_1d)
    
    x_prod_2d = x_fingerprints_2d[calib_num:]
    np.save(directory + 'x_prod_fingerprints_2d.npy', x_prod_2d)
    
    y_prod = dfl['exp'].values[calib_num:]
    np.save(directory + 'y_prod.npy', y_prod)

The 1D arrays have 2048 elements, the 2D arrays have shape (32, 64). 1D arrays will be used for non-neural network algorithms, 1D CNN and RNN. While the 2D arrays will be used for a 2D CNN.

There are other methods of converting molecular data to features that are specific to the type of machine learning algorithm used, such as graph convolution, message passing, etc. These will be covered in future articles.

3. Target

We generate statistics for the target variable.

import numpy as np
import pandas as pd
   
def target_stats():
    directory = YOUR DATA DIRECTORY
    dfl = pd.read_pickle(directory + 'df_lipophilicity_raw_smiles_shuffled.pkl')
    
    data = dfl['exp'].values
    percentile_start=0.1
    percentile_step=0.1    
    percentiles = np.arange(percentile_start, 1, percentile_step)
    header = []
    for p in percentiles:
        name = str(int(100.01*p))+'th Percentile'
        header.append(name)
        
    header.append('Minimum')
    header.append('Maximum')
    header.append('Mean')
    header.append('Standard Deviation')
    header.append('Variance')
    header.append('Standard Error')
    header.append('# of Points')
    
    list_results = list(np.quantile(data, percentiles))
    list_results.append(np.min(data))
    list_results.append(np.max(data))
    list_results.append(np.mean(data))
    list_results.append(np.std(data))
    list_results.append(np.var(data))
    list_results.append(np.std(data)/np.sqrt(len(data)))
    list_results.append(len(data))
    
    df_results = (pd.DataFrame(data = list_results, index = header)).T
    df_results.to_csv('target_statistics.csv', index=False)

10th Percentile	20th Percentile	30th Percentile	40th Percentile	50th Percentile	60th Percentile	70th Percentile	80th Percentile	90th Percentile	Minimum	Maximum	Mean	Standard Deviation	Variance	Standard Error	# of Points
0.52	1.178	1.64	2.0	2.36	2.67	2.95	3.25	3.6	-1.5	4.5	2.186	1.202	1.446	0.0185	4200

4. Visualizing Molecules

RDKit has functions to convert SMILES strings to images.

from rdkit.Chem import Draw
Draw.MolToFile(Chem.MolFromSmiles(‘CN(C)C(=O)[C@H](Cc1ccccc1)NC(=O)c2cc3ccsc3[nH]2’), 'molecule.png')

As a teaser for a future article, think about CNNs.

Prediction of Small Molecule Lipophilicity: Part 1 – Data Exploration