Machine Learning Papers: Molecules
A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility - Tang et al 2020
DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures - O'Boyle and Dalke 2018
There has been increasing interest in the use of deep neural networks for de novo design of molecules with desired properties. A common approach is to train a generative model on SMILES strings and then use this to generate SMILES strings for molecules with a desired property. Unfortunately, these SMILES strings are often not syntactically valid due to elements of SMILES syntax that must occur in pairs.
We describe a SMILES-like syntax called DeepSMILES that addresses two of the main reasons for invalid syntax when using a probabilistic model to generate SMILES strings. The DeepSMILES syntax avoids the problem of unbalanced parentheses by only using close parentheses, where the number of parentheses indicates the branch length. In addition, DeepSMILES avoids the problem of pairing ring closure symbols by using only a single symbol at the ring closing location, where the symbol indicates the ring size. We show that this syntax can be interconverted to/from SMILES with string processing without any loss of information, including stereo configuration.
We believe that DeepSMILES will be useful, not just for those using SMILES in deep neural networks, but also for other computational methods that use SMILES as the basis for generating molecular structures such as genetic algorithms.
A de novo molecular generation method using latent vector based generative adversarial network - Prykhodko et al. 2019
While the development of new drugs is costly, time consuming, and often accompanied with safety issues, drug repurposing, where old drugs with established safety are used for medical conditions other than originally developed, is an attractive alternative. Then, how the old drugs work on new targets becomes a crucial part of drug repurposing and gains much of interest. Several statistical and machine learning models have been proposed to estimate drug–target binding affinity and deep learning approaches have been shown to be among state-of-the-art methods. However, drugs and targets in these models have been commonly represented in 1D strings, regardless the fact that molecules are by nature formed by the chemical bonding of atoms.
In this work, we propose GraphDTA to capture the structural information of drugs, possibly enhancing the predictive power of the affinity. In particular, unlike competing methods, drugs are represented as graphs and graph convolutional networks are used to learn drug–target binding affinity. We trial our method on two benchmark datasets of drug–target binding affinity and compare the performance with state-of-the-art models in the research field.
Convolutional neural network based on SMILES representation of compounds for detecting chemical motif - Hirohara et al 2018
Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features.
We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected.