Peptide Binding – Part 2: Recurrent Neural Networks

Peptide Binding – Part 1: 1D Convolutional Neural Network

1. Introduction

In this article, we continue our exploration of the peptide binding problem described in Peptide Binding – Part 1: 1D Convolutional Neural Network. We will use recurrent neural networks (RNN) with three kinds of recurrent units: simple, Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM). Additionally, we will use all three types of RNN units in bidirectional mode, in which the RNN looks at each input data point from left to right and then from right to left. We will see that single layer, unregularized, unoptimized, RNNs produce better results than the comparable 1d CNN.

2. Recurrent Neural Networks

The basic idea behind recurrent neural networks is to exploit the structure of the input data and transform it into a latent space such that many different types of machine learning algorithms can use for garnering insights. RNNs are thus, as are CNNs, fundamentally feature extractors, selectors, and transformers. RNNs look at the entirety of an input data point and use feedback from the output of the RNN units to enable a form of memory. So this type of algorithm is well suited for data in which order matters. For our problem, sequences of amino acids that comprise peptides clearly have ordered structures so we expect that RNNs will perform well on our classification task.

Below are some resources for more detailed information about RNNs.

Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network – Alex Sherstinsky
Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs – Denny Britz
Understanding LSTM Networks – colah’s blog
Recurrent neural networks and LSTM tutorial in Python and TensorFlow – Adventures in Machine Learning

3. Data & Code

See Peptide Binding – Part 1: 1D Convolutional Neural Network for information about raw data and code for transforming data into formats suitable for the RNN. We use the same transformations here. As a brief refresher, raw features are sequences of amino acids, in which the amino acids are represented by letters and the sequences are then words. The letters are then assigned integers, so that LLTDAQRIV = [10, 10, 17, 3, 1, 14, 15, 8, 18], and this is then fed into a Keras embedding layer. By using such character level embedding, we allow the RNN itself to determine internal representations of features.

There are 3 classes that comprise the targets, they are one hot encoded. NB (Non Binder) = 0 = [1., 0., 0.], WB (Weak Binder) = 1 = [0., 1., 0.], SB (Strong Binder) = 2 = [0., 0., 1.].

We used 150 epochs with a batch size of 128. There is no early stopping, regularization, or attempt to do a parameter search. The results here are used to establish baseline performance to use in a latter article to enhance the models with early stopping, etc. However, despite such limitations, over fitting is not too egregious, as can be seen in the validation loss and accuracy curves below, so we also show confusion matrices and accuracies.

from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import SimpleRNN, GRU, LSTM, Bidirectional

def rnn_one_layer(xtrainraw, ytrain, xtestraw, ytest, ...):
    # create the rnn
    num_output_nodes = num_classes
    layer_1_units = 32
    model_rnn = Sequential()
    model_rnn.add(Embedding(vocab_size, embedding_dimension, 
                           input_length=maximum_length))
    if rnn_type == 'SimpleRNN':
        if bidirectional_flag:
            model_rnn.add(Bidirectional(SimpleRNN(layer_1_units)))
        else:
            model_rnn.add(SimpleRNN(layer_1_units))
    elif rnn_type == 'GRU':
        if bidirectional_flag:
            model_rnn.add(Bidirectional(GRU(layer_1_units)))
        else:
            model_rnn.add(GRU(layer_1_units))
    elif rnn_type == 'LSTM':
        if bidirectional_flag:
            model_rnn.add(Bidirectional(LSTM(layer_1_units)))
        else:
            model_rnn.add(LSTM(layer_1_units))
    else:
        print('\ninvalid rnn_type in rnn_one_layer():',rnn_type)
        raise NameError
        
    model_rnn.add(Dense(num_output_nodes, activation='softmax'))
    # compile the model
    model_rnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
    # summarize the model
    print('\n** results for',model_name)
    print(model_rnn.summary())
    # fit the model
    h = model_rnn.fit(xtrain_rnn, ytrain, epochs=num_epochs, 
                            batch_size=batchsize,
                            validation_data=(xtest_rnn, ytest), 
                            verbose=verbosity)

4. Results

Graphs: Validation Loss, Validation Accuracy, Confusion Matrices, Normalized Confusion Matrices

	Accuracy	Balanced Accuracy
SimpleRNN	0.9267	0.9269
Bidirectional SimpleRNN	0.9212	0.9215
GRU	0.9414	0.9417
Bidirectional GRU	0.9259	0.9262
LSTM	0.9229	0.9232
Bidirectional LSTM	0.9208	0.9211

The RNN results are superior to the 1d CNN, 91.62% accuracy, and comparable to the results reported in the original article, 94.8% for a MLP, 92% for a 2d CNN, and 82.2% for random forest. Recall that the paper used additional biological information to enhance the data, but also no optimization or regularization was attempted.