In this paper, the authors modified a seq2seq RNN constructed for language translation to a seq2seq RNN autoencoder (specifically for SMILES input-output) so that the resultant latent data space could be used as molecular fingerprints for subsequent machine learning tasks. Note that unlike many such projects, their goal was not to create a generative variational autoencoder for creating new molecules. Rather, their goal was to use a large database of unlabeled SMILES to populate a latent data space which could then be used by encoding small, labeled SMILES data sets.
There are 3 main ideas incorporated in their model:
Attention Mechanism So far, the only connection between the perceiver and interpreter networks is the sharing hidden memory. When the sequence becomes longer, it becomes extremely challenging to pass the information from the perceiver to the interpreter network through the hidden memory. To address this issue, the attention mechanism is employed to establish a stronger connection and provide soft-alignment between the perceiver and interpreter networks.
Dropout Layer One of the most favorable features in our model is the capability to use nearly unlimited molecular training data. However, the over-fitting issue will come to play if we grow our data unrestrictively. To enhance the generalizability of our model, we add dropout layer to each input, output gate and yet we do not add the dropout for the hidden memory transferring gate
While we improve the original sequence-to-sequence model from several aspects, we keep using the reverse technique introduced in , where the source sequence is mapped to the reverse sequence of the target.
Code can be found at: https://github.com/XericZephyr/seq2seq-fingerprint.
There is a modified implementation (no attention mechanism; variational autoencoder functionality added) of this model in DeepChem:
Below is the abstract of Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery.
Many of today’s drug discoveries require expertise knowledge and insanely expensive biological experiments for identifying the chemical molecular properties. However, despite the growing interests of using supervised machine learning algorithms to automatically identify those chemical molecular properties, there is little advancement of the performance and accuracy due to the limited amount of training data. In this paper, we propose a novel unsupervised molecular embedding method, providing a continuous feature vector for each molecule to perform further tasks, e.g., solubility classification. In the proposed method, a multi-layered Gated Recurrent Unit (GRU) network is used to map the input molecule into a continuous feature vector of fixed dimensionality, and then another deep GRU network is employed to decode the continuous vector back to the original molecule. As a result, the continuous encoding vector is expected to contain rigorous and enough information to recover the original molecule and predict its chemical properties. The proposed embedding method could utilize almost unlimited molecule data for the training phase. With sufficient information encoded in the vector, the proposed method is also robust and task-insensitive. The performance and robustness are confirmed and interpreted in our extensive experiments.