SMILES (Simplified Molecular Input Line Entry System) representations of molecules have found many uses in machine learning algorithms, especially those derived from natural language processing techniques. However, they were not designed for machine learning and thus suffer from various syntax issues that can hamper machine learning methods, especially generative methods. DeepSMILES is a modification of SMILES explicitly designed to address these issues.
The approach we propose is to transform a SMILES string to another syntax that conveys identical information but is more suited to machine-learning, a syntax we will refer to as DeepSMILES. Having converted the training set to DeepSMILES, these strings (and not the original SMILES) are then used as input to the machine-learning method. This process is then carried out in reverse to interpret the output of a generative model, by converting the generated DeepSMILES strings to SMILES strings for subsequent processing or display.
DeepSMILES addresses the syntax problems described above as follows:
- DeepSMILES uses a single ring closure digit instead of two. Thus unmatched ring closure digits are not possible.
- DeepSMILES avoids the use of paired parentheses by adopting a postfix notation, where one or more close parentheses are used to indicate branch length.
DeepSMILES also corrects many of the problems in which single molecules have different SMILES representations which machine learning algorithms may not be able to identify as such. For example:
Note that while a phenyl ring in a larger molecule may sometimes occur in SMILES notation as c2ccccc2, c3ccccc3, etc., it will always occur with the same notation in RP-SMILES, namely cccccc6.
SMILES can be easily converted to DeepSMILES with a function call via the DeepSMILES Python package.
The abstract of DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures is below.
There has been increasing interest in the use of deep neural networks for de novo design of molecules with desired properties. A common approach is to train a generative model on SMILES strings and then use this to generate SMILES strings for molecules with a desired property. Unfortunately, these SMILES strings are often not syntactically valid due to elements of SMILES syntax that must occur in pairs.
We describe a SMILES-like syntax called DeepSMILES that addresses two of the main reasons for invalid syntax when using a probabilistic model to generate SMILES strings. The DeepSMILES syntax avoids the problem of unbalanced parentheses by only using close parentheses, where the number of parentheses indicates the branch length. In addition, DeepSMILES avoids the problem of pairing ring closure symbols by using only a single symbol at the ring closing location, where the symbol indicates the ring size. We show that this syntax can be interconverted to/from SMILES with string processing without any loss of information, including stereo configuration.
We believe that DeepSMILES will be useful, not just for those using SMILES in deep neural networks, but also for other computational methods that use SMILES as the basis for generating molecular structures such as genetic algorithms.