This is the first in a series of articles in which we use machine learning algorithms to predict lipophilicity values for small molecules. Lipophilicity is defined as ‘octanol/water distribution coefficient (logD at pH 7.4).’
Data is from MoleculeNet: A Benchmark for Molecular Machine Learning. The raw data can be found here (downloadable zip file). Below is a snippet of the raw data.
The column labeled ‘smiles’ contains SMILES (Simplified Molecular Input Line Entry System) ASCII character representation of molecules. One of the most important issues in applying machine learning to the analysis of small molecules is how to convert representations of molecular information, such as SMILES, into matrices of real numbers, the required input format for machine learning. This process is called featurization.
Our first method of featurization will entail character embedding of SMILES strings. We use the same approach as in Peptide Binding – Part 1: 1D Convolutional Neural Network.
import numpy as np import pandas as pd def smiles_to_embedding(): calib_fraction = 0.8 directory = YOUR DATA DIRECTORY dfl = pd.read_pickle(directory + 'df_lipophilicity_raw_smiles_shuffled.pkl') smiles_array = dfl['smiles'].values # create sorted, unique array of smiles characters array_smiles_unique = np.unique(list("".join(smiles_array))) # create embedded (not padded) mapping = dict(zip(array_smiles_unique, range(1, 1 + len(array_smiles_unique)))) list_embedded_smiles =  for s in smiles_array: list_embedded_smiles.append([mapping[i] for i in s]) df_lipophilicity_embedded_smiles_shuffled = pd.DataFrame(index=dfl.index, data=dfl.values, columns=dfl.columns) df_lipophilicity_embedded_smiles_shuffled['smiles embedded'] = list_embedded_smiles df_lipophilicity_embedded_smiles_shuffled.to_pickle(directory + 'df_lipophilicity_embedded_smiles_shuffled.pkl') # split and save as numpy arrays calib_num = int(calib_fraction*df_lipophilicity_embedded_smiles_shuffled.shape) x_calib = df_lipophilicity_embedded_smiles_shuffled['smiles embedded'].values[:calib_num] np.save(directory + 'x_calib_smiles_embedded.npy', x_calib) x_prod = df_lipophilicity_embedded_smiles_shuffled['smiles embedded'].values[calib_num:] np.save(directory + 'x_prod_smiles_embedded.npy', x_prod)
Previously, we read in the raw data, converted it to a Dataframe, and shuffled it.
Now a SMLIES string becomes a 1D vector of integers, in which each ASCII character is assigned a unique number, starting at 1. So, ‘CN(C)C(=O)[C@H](Cc1ccccc1)NC(=O)c2cc3ccsc3[nH]2’ becomes, [22,26,3,22,4,22,3,…]. Later, in neural network codes, we will append 0s on the right so that all vectors have the same length.
(We thought that we had made a mistake, then deleted the code and remarks above, and then realized that we did not, and are now restoring the information. Character level embedding of SMILES strings has some odd characteristics. We must remember that SMILES strings encode chemical information at the single and multiple character levels. When we encode such strings at such levels (multiple characters would be “words”), we do not do so based on chemically relevant information. Rather, we are simply attempting to use a CNN or RNN to match patterns of characters. For example, Si is treated as S and i when embedding at the single character level, despite i not having chemical relevance independent of Si. What we assume is that the neural network will learn to associate i with Si and distinguish S from Si in the context of the particular problem to be solved. Here, we assume that the network will learn such distinctions in the process of minimizing the mean squared error between its predictions of lipophilicity and actual values. As we will see in subsequent articles, graph convolutional networks represent the best approach of embedding chemically relevant information for neural networks.)
Morgan (circular) fingerprints are another method of representing molecular information. In the code below, we use RDKit to convert SMILES strings into fingerprints. These are bit vectors, which are subsequently converted into Numpy arrays. Each array is a 1D binary vector, in which 1s represent the existence of a local (radius of 2) molecular feature (see Morgan Fingerprints (Circular Fingerprints) and Feature Definitions Used in the Morgan Fingerprints).
import numpy as np import pandas as pd from rdkit import Chem from rdkit.Chem import AllChem def smiles_to_morgan_fingerprint(): calib_fraction = 0.8 directory = YOUR DATA DIRECTORY dfl = pd.read_pickle(directory + 'df_lipophilicity_raw_smiles_shuffled.pkl') smiles_array = dfl['smiles'].values # 1d # generate molecues from smiles strings list_molecules =  for s in smiles_array: list_molecules.append(Chem.MolFromSmiles(s)) # generate fingeprints: Morgan fingerprint with radius 2 list_fingerprints =  for molecule in list_molecules: list_fingerprints.append(AllChem.GetMorganFingerprintAsBitVect(molecule, 2)) # convert the RDKit explicit vectors into numpy arrays # each is a binary vector of length 2048 x_fingerprints_1d = np.asarray(list_fingerprints) x_fingerprints_2d = np.reshape(x_fingerprints_1d, (x_fingerprints_1d.shape, 32, 64)) # split data then save as numpy arrays calib_num = int(calib_fraction*dfl.shape) x_calib_1d = x_fingerprints_1d[:calib_num] np.save(directory + 'x_calib_fingerprints_1d.npy', x_calib_1d) x_calib_2d = x_fingerprints_2d[:calib_num] np.save(directory + 'x_calib_fingerprints_2d.npy', x_calib_2d) y_calib = dfl['exp'].values[:calib_num] np.save(directory + 'y_calib.npy', y_calib) x_prod_1d = x_fingerprints_1d[calib_num:] np.save(directory + 'x_prod_fingerprints_1d.npy', x_prod_1d) x_prod_2d = x_fingerprints_2d[calib_num:] np.save(directory + 'x_prod_fingerprints_2d.npy', x_prod_2d) y_prod = dfl['exp'].values[calib_num:] np.save(directory + 'y_prod.npy', y_prod)
The 1D arrays have 2048 elements, the 2D arrays have shape (32, 64). 1D arrays will be used for non-neural network algorithms, 1D CNN and RNN. While the 2D arrays will be used for a 2D CNN.
There are other methods of converting molecular data to features that are specific to the type of machine learning algorithm used, such as graph convolution, message passing, etc. These will be covered in future articles.
We generate statistics for the target variable.
import numpy as np import pandas as pd def target_stats(): directory = YOUR DATA DIRECTORY dfl = pd.read_pickle(directory + 'df_lipophilicity_raw_smiles_shuffled.pkl') data = dfl['exp'].values percentile_start=0.1 percentile_step=0.1 percentiles = np.arange(percentile_start, 1, percentile_step) header =  for p in percentiles: name = str(int(100.01*p))+'th Percentile' header.append(name) header.append('Minimum') header.append('Maximum') header.append('Mean') header.append('Standard Deviation') header.append('Variance') header.append('Standard Error') header.append('# of Points') list_results = list(np.quantile(data, percentiles)) list_results.append(np.min(data)) list_results.append(np.max(data)) list_results.append(np.mean(data)) list_results.append(np.std(data)) list_results.append(np.var(data)) list_results.append(np.std(data)/np.sqrt(len(data))) list_results.append(len(data)) df_results = (pd.DataFrame(data = list_results, index = header)).T df_results.to_csv('target_statistics.csv', index=False)
|10th Percentile||20th Percentile||30th Percentile||40th Percentile||50th Percentile||60th Percentile||70th Percentile||80th Percentile||90th Percentile||Minimum||Maximum||Mean||Standard Deviation||Variance||Standard Error||# of Points|
4. Visualizing Molecules
RDKit has functions to convert SMILES strings to images.
from rdkit.Chem import Draw Draw.MolToFile(Chem.MolFromSmiles(‘CN(C)C(=O)[C@H](Cc1ccccc1)NC(=O)c2cc3ccsc3[nH]2’), 'molecule.png')
As a teaser for a future article, think about CNNs.