In this article, we continue our exploration of the peptide binding problem described in Peptide Binding – Part 1: 1D Convolutional Neural Network. We will use recurrent neural networks (RNN) with three kinds of recurrent units: simple, Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM). Additionally, we will use all three types of RNN units in bidirectional mode, in which the RNN looks at each input data point from left to right and then from right to left. We will see that single layer, unregularized, unoptimized, RNNs produce better results than the comparable 1d CNN.
2. Recurrent Neural Networks
The basic idea behind recurrent neural networks is to exploit the structure of the input data and transform it into a latent space such that many different types of machine learning algorithms can use for garnering insights. RNNs are thus, as are CNNs, fundamentally feature extractors, selectors, and transformers. RNNs look at the entirety of an input data point and use feedback from the output of the RNN units to enable a form of memory. So this type of algorithm is well suited for data in which order matters. For our problem, sequences of amino acids that comprise peptides clearly have ordered structures so we expect that RNNs will perform well on our classification task.
Below are some resources for more detailed information about RNNs.
- Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network – Alex Sherstinsky
- Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs – Denny Britz
- Understanding LSTM Networks – colah’s blog
- Recurrent neural networks and LSTM tutorial in Python and TensorFlow – Adventures in Machine Learning
3. Data & Code
See Peptide Binding – Part 1: 1D Convolutional Neural Network for information about raw data and code for transforming data into formats suitable for the RNN. We use the same transformations here. As a brief refresher, raw features are sequences of amino acids, in which the amino acids are represented by letters and the sequences are then words. The letters are then assigned integers, so that LLTDAQRIV = [10, 10, 17, 3, 1, 14, 15, 8, 18], and this is then fed into a Keras embedding layer. By using such character level embedding, we allow the RNN itself to determine internal representations of features.
There are 3 classes that comprise the targets, they are one hot encoded. NB (Non Binder) = 0 = [1., 0., 0.], WB (Weak Binder) = 1 = [0., 1., 0.], SB (Strong Binder) = 2 = [0., 0., 1.].
We used 150 epochs with a batch size of 128. There is no early stopping, regularization, or attempt to do a parameter search. The results here are used to establish baseline performance to use in a latter article to enhance the models with early stopping, etc. However, despite such limitations, over fitting is not too egregious, as can be seen in the validation loss and accuracy curves below, so we also show confusion matrices and accuracies.
from keras.models import Sequential from keras.layers import Dense, Embedding from keras.layers import SimpleRNN, GRU, LSTM, Bidirectional def rnn_one_layer(xtrainraw, ytrain, xtestraw, ytest, ...): # create the rnn num_output_nodes = num_classes layer_1_units = 32 model_rnn = Sequential() model_rnn.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) if rnn_type == 'SimpleRNN': if bidirectional_flag: model_rnn.add(Bidirectional(SimpleRNN(layer_1_units))) else: model_rnn.add(SimpleRNN(layer_1_units)) elif rnn_type == 'GRU': if bidirectional_flag: model_rnn.add(Bidirectional(GRU(layer_1_units))) else: model_rnn.add(GRU(layer_1_units)) elif rnn_type == 'LSTM': if bidirectional_flag: model_rnn.add(Bidirectional(LSTM(layer_1_units))) else: model_rnn.add(LSTM(layer_1_units)) else: print('\ninvalid rnn_type in rnn_one_layer():',rnn_type) raise NameError model_rnn.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_rnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc']) # summarize the model print('\n** results for',model_name) print(model_rnn.summary()) # fit the model h = model_rnn.fit(xtrain_rnn, ytrain, epochs=num_epochs, batch_size=batchsize, validation_data=(xtest_rnn, ytest), verbose=verbosity)
The RNN results are superior to the 1d CNN, 91.62% accuracy, and comparable to the results reported in the original article, 94.8% for a MLP, 92% for a 2d CNN, and 82.2% for random forest. Recall that the paper used additional biological information to enhance the data, but also no optimization or regularization was attempted.