Peptide Binding - Part 1: 1D Convolutional Neural Network

1. Introduction

In Deep Learning for Cancer Immunotherapy, Leon Eyrich Jessen used various machine learning algorithms to classify peptide binding strength into three classes, ‘strong binder’ SB, ‘weak binder’ WB or ‘non-binder’ NB. The author used 9-mer peptides, sequences of 9 amino acids, as raw input data. This data was created and transformed as follows:

The input data for this use case was created by generating 1,000,000 random 9-mer peptides by sampling the one-letter code for the 20 amino acids, i.e. ARNDCQEGHILKMFPSTWYV, and then submitting the peptides to MHCI binding prediction using the current state-of-the-art model netMHCpan. Different variants of MHCI exists, so for this case we chose HLA-A*02:01.

… each peptide will be encoded by assigning a vector of 20 values, where each value is the probability of the amino acid mutating into 1 of the 20 others as defined by the BLOSUM62 matrix using the pep_encode() function from the PepTools package. This way each peptide is converted to an ‘image’ matrix with 9 rows and 20 columns.

This is then modeling a model.

After converting the data in this way, it was scaled to values between 0 and 1. Then multilayer perceptron (MLP), 2D convolutional neural network, and random forest algorithms were used (the 2D data was flattened for the MLP and random forest).

See the article for results and discussion about the biology behind the problem.

Our article is the first in a series, where we will explore this problem via a variety of methods. We begin with a 1D convolutional neural network.

2. Data

The original raw data can be found here. In case this link dies, we provide a zipped csv file.

Below is a snippet of the data:

peptide	label_chr	label_num	data_type
LLTDAQRIV	WB	1	train
LMAFYLYEV	SB	2	train
VMSPITLPT	WB	1	test
SLHLTNCFV	WB	1	train
RQFTCMIAV	WB	1	train

The following code deletes the label_chr column, and divides the data into calibration (which is then divided into train and test) and production, with the latter serving as new, unseen data.

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

def create_data():
    # get raw data
    dfraw = pd.read_csv(data_dir + 'raw_data.csv')
    dfraw.drop(columns=['label_chr'], inplace=True)
    
    # split into calibration (train) and production (test)
    type_group = dfraw.groupby('data_type')
    for group_name, dfgroup in type_group:
        dfgroup.drop(columns=['data_type'], inplace=True)
        if group_name == 'train':
            dfgroup = shuffle(dfgroup)
            dfgroup.to_pickle(data_dir + 'df_calibration.pkl')
            # header = ['peptide', 'label_num']
            print('calibration data size:',dfgroup.shape)
            # split calib into train/test (80/20)       
            df_train, df_test = train_test_split(dfgroup, train_size=0.8, shuffle=True)
            df_train.to_pickle(data_dir + 'df_train.pkl')
            print('train data size:',df_train.shape)
            df_test.to_pickle(data_dir + 'df_test.pkl')
            print('test data size:',df_test.shape)
        elif group_name == 'test':
            dfgroup.to_pickle(data_dir + 'df_production.pkl')
            print('production data size:',dfgroup.shape)

The output is:

production data size: (2376, 2)
calibration data size: (21384, 2)
train data size: (17107, 2)
test data size: (4277, 2)

Peptides are represented as strings of letters. For this problem, the alphabet is restricted to the 20 letters that are abbreviations for amino acids directly encoded by genes in humans.

Image from: https://i2.wp.com/www.compoundchem.com/wp-content/uploads/2014/09/20-Common-Amino-Acids-v3.png

We represent peptides via character level embedding. First, letters are assigned unique integers using the following code:

import numpy as np

def encode_sequence(sequence, sequence_type, encoding_type):
    
    if sequence_type == 'dna':
        letters = 'ACGT'
    elif sequence_type == 'amino_acid_20':
        letters = 'ACDEFGHIKLMNPQRSTVWY'
    elif sequence_type == 'amino_acid_22':
        letters = 'ACDEFGHIKLMNOPQRSTUVWY'
    else:
        print('\ninvalid sequence_type in encode_sequence()',sequence_type)
        print('in encode_biological_sequence.py')
        raise NameError
        
    num_letters = len(letters)
    
    if encoding_type == 'one_hot':    
        mapping = dict(zip(letters, range(num_letters)))    
        sequence_mapped = [mapping[i] for i in sequence]
        transformed_sequence =  np.eye(num_letters)[sequence_mapped]
    elif encoding_type == 'decimal_vector':
        mapping = dict(zip(letters, np.linspace(1/num_letters,1,
                                                num=num_letters)))
        transformed_sequence = [mapping[i] for i in sequence]
    elif encoding_type == 'embedding':
        # reserve 0 for padding
        mapping = dict(zip(letters, range(1, num_letters+1)))
        transformed_sequence = [mapping[i] for i in sequence]
    else:
        print('\ninvalid encoding_type in encode_sequence()',encoding_type)
        print('in encode_biological_sequence.py')
        raise NameError
            
    return transformed_sequence

For example, this code yields:
LLTDAQRIV = [10, 10, 17, 3, 1, 14, 15, 8, 18]
LMAFYLYEV = [10, 11, 1, 5, 20, 10, 20, 4, 18]
VMSPITLPT = [18, 11, 16, 13, 8, 17, 10, 13, 17]

In the following piece of code, the above function is encode_biological_sequence.encode_sequence(), which performs the transformation above. Then using the reserved integer 0, the integer representations is padded out if the peptide lengths are different (0s are added to the end). Finally, the Keras embedding layer is used so that the CNN can learn an internal representation of the peptides as part of the overall learning process of assigning peptide binding strengths. Thus feature importances are computed and utilized automatically.

from keras.layers import Embedding
from keras.preprocessing.sequence import pad_sequences
import encode_biological_sequence

def cnn_one_layer(xtrainraw, ... ):
    embedding_dimension = 32
    vocab_size = 50  # > number of possible unique words
    maximum_length = 0
    if type_of_encoding == 'embedding':
        # integer encode the peptides
                
        peptides_integer_encoded_train = []
        for pep in xtrainraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_train.append(pie)
        
        peptides_integer_encoded_test = []
        for pep in xtestraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_test.append(pie)
        
        peptides_integer_encoded_prod = []
        for pep in xprodraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_prod.append(pie)
                
        
        # add padding so that all have the same length
        max_length_train = max(len(p) for p in peptides_integer_encoded_train)
        max_length_test = max(len(p) for p in peptides_integer_encoded_test)
        maximum_length = max(max_length_train, max_length_test)
        assert(vocab_size > maximum_length)
        
        xtrain_cnn = pad_sequences(peptides_integer_encoded_train, 
                                   maxlen=maximum_length, padding='post')
        xtest_cnn = pad_sequences(peptides_integer_encoded_test, 
                                  maxlen=maximum_length, padding='post')
        xprod_cnn = pad_sequences(peptides_integer_encoded_prod, 
                                  maxlen=maximum_length, padding='post')
    else:
        raise NameError
                  
    # create the cnn
    model_cnn.add(Embedding(vocab_size, embedding_dimension, 
                           input_length=maximum_length))

The variables xtrainraw, xtestraw, xprodraw are the peptides represented as strings, like ‘LLTDAQRIV’.

The embedding dimension was determined arbitrarily. It will be optimized in a later article.

The targets are binding affinity with T cell receptors which are represented by the three classes: NB (Non Binder) = 0, WB (Weak Binder) = 1, SB (Strong Binder) = 2. We then use one hot encoding:

from keras.utils.np_utils import to_categorical
y_train = to_categorical(df_train['label_num'].values)

so that 0 = [1., 0., 0.], 1 = [0., 1., 0.], and 2 = [0., 0., 1.].

3. Simple 1D CNN

We use a simple 1d CNN for this classification problem. The architecture is simple, there is no regularization, or attempt to optimize any parameter. The goal here is to establish a baseline model. Enhancements will be applied in a later article.

from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import Conv1D, MaxPooling1D, Flatten

def cnn_one_layer(xtrainraw, ... ):
    num_output_nodes = num_classes
    layer_1_units = 32
    model_cnn = Sequential()
    model_cnn.add(Embedding(vocab_size, embedding_dimension, 
                           input_length=maximum_length))
    model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu'))
    model_cnn.add(MaxPooling1D(pool_size=pooling_size))
    model_cnn.add(Flatten())
    model_cnn.add(Dense(num_output_nodes, activation='softmax'))
    # compile the model
    model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
    # summarize the model
    print('\n** results for simple cnn:')
    print(model_cnn.summary())
    # fit the model
    h = model_cnn.fit(xtrain_cnn, ytrain, epochs=num_epochs, 
                            batch_size=batchsize,
                            validation_data=(xtest_cnn, ytest), 
                            verbose=verbosity)

There are 3 classes which determines the size of the output layer as we are using one hot encoding of the classes. For the kernel size in the convolutional layer, we tried 5 and 7 with 7 yielding the best result. Also, the pool size was set to 3.

Here is a view of the CNN:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 9, 32) 1600
_________________________________________________________________
conv1d_2 (Conv1D) (None, 3, 32) 7200
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 1, 32) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 32) 0
_________________________________________________________________
dense_2 (Dense) (None, 3) 99
=================================================================
Total params: 8,899
Trainable params: 8,899
Non-trainable params: 0
_________________________________________________________________

Here are plots of the loss and accuracy (note that the classes are well balanced):

4. Simple 1D CNN With Cross Validation And Early Stopping

We use the same CNN as above and implement 5 fold cross validation. Additionally, based on the loss and accuracy curves, we add early stopping, so that training is stopped if the validation loss does not decrease in 10 epochs. Since we are not optimizing parameters, we use the maximum number of epochs for the folds to run the model again and use it to predict the classification of production (unseen) data.

    metric = 'acc'
    model_file = model_name + '.h5'
    callbacks_list = [
            EarlyStopping(monitor='val_loss', patience=stop_epochs),
            ModelCheckpoint(filepath=model_file, monitor='val_loss', save_best_only=True)]
    num_output_nodes = num_classes
    layer_1_units = 32
    
    list_cv_metrics = []
    list_cv_losses = []
    list_cv_epochs = []
    cv_num = int(xcalib_cnn.shape[0]/folds)
    for fold in range(folds):
        # create the cnn
        model_cnn = Sequential()
        model_cnn.add(Embedding(vocab_size, embedding_dimension, 
                               input_length=maximum_length))
        model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, 
                             activation='relu'))
        model_cnn.add(MaxPooling1D(pool_size=pooling_size))
        model_cnn.add(Flatten())
        model_cnn.add(Dense(num_output_nodes, activation='softmax'))
        # compile the model
        model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', 
                          metrics=[metric])
        
        # get train/valid
        x_train = np.vstack((xcalib_cnn[:cv_num*fold], xcalib_cnn[cv_num*(fold+1):]))
        y_train = np.vstack((ycalib[:cv_num*fold], ycalib[cv_num*(fold+1):]))
        
        x_valid = xcalib_cnn[cv_num*fold:cv_num*(fold+1)]
        y_valid = ycalib[cv_num*fold:cv_num*(fold+1)]
        
        # fit the model
        h = model_cnn.fit(x_train, y_train, 
                          epochs=num_epochs, 
                          batch_size=batchsize,
                          validation_data=(x_valid, y_valid), 
                          callbacks=callbacks_list,
                          verbose=verbosity)
        
        # collect cv stats
        cv_loss, cv_metric = model_cnn.evaluate(x_valid, y_valid, verbose=0)
        list_cv_metrics.append(cv_metric)
        list_cv_losses.append(cv_loss)
        list_cv_epochs.append(len(h.history['val_loss']))

    # redo the model with the maximum epochs from the cv and all calib data, no early stopping
    model_cnn_final = Sequential()
    model_cnn_final.add(Embedding(vocab_size, embedding_dimension, 
                           input_length=maximum_length))
    model_cnn_final.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, 
                         activation='relu'))
    model_cnn_final.add(MaxPooling1D(pool_size=pooling_size))
    model_cnn_final.add(Flatten())
    model_cnn_final.add(Dense(num_output_nodes, activation='softmax'))
    # compile the model
    model_cnn_final.compile(optimizer='adam', loss='categorical_crossentropy', 
                            metrics=[metric])
    # summarize the model
    print('\n** final results for',model_name)
    # fit the model
    h = model_cnn_final.fit(xcalib_cnn, ycalib, 
                      epochs=np.max(list_cv_epochs), 
                      batch_size=batchsize,
                      verbose=verbosity)
    
    y_prod = np.argmax(yprod, axis=1)
    y_prod_predict = np.argmax(model_cnn_final.predict(xprod_cnn), axis=1)

Note that by using one hot encoding of the classes and softmax in the output layer, the predictions are class probabilities. These probabilities are then set to a definite class via plurality voting. This is done in the last two lines of the code.

Here are the results of the final 1d CNN:

Cross validation metrics:

Fold	Accuracy	Loss	Epochs
1	0.90388	0.24380	49
2	0.90341	0.23927	45
3	0.90598	0.23742	42
4	0.90598	0.22416	50
5	0.90739	0.21724	45

max cross validation metric = 0.90739
min cross validation metric = 0.90341
mean cross validation error = 0.90533

production metrics: balanced accuracy = 0.91645
production metrics: accuracy = 0.91624

Production classification report:

	Precision	Recall	F1 Score	Support
0	0.98	0.97	0.97	782
1	0.89	0.86	0.87	792
2	0.88	0.92	0.90	802
Accuracy			0.92	2376
Macro Average	0.92	0.92	0.92	2376
Weighted Average	0.92	0.92	0.92	2376

Our final result of 91.62% compares favorably with the results reported in the article, 94.8% for a MLP, 92% for a 2d CNN, and 82.2% for random forest. Recall that the paper used additional biological information to enhance the data, but also no optimization or regularization was attempted. Unfortunately, the article only shows confusion matrices graphically without numbers, so we cannot compare them with our results.

5. Code

Below are listings of the codes used in this article.

1D CNN code:

import os
import pandas as pd
import numpy as np
import sys
from pathlib import Path

from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import SimpleRNN, GRU, LSTM, Bidirectional
from keras.layers import Conv1D, MaxPooling1D, Flatten
from keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import ModelCheckpoint, EarlyStopping

from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

sys.path.append(YOUR PATH)
import confusion_matrix_plot
import train_validation_plots

sys.path.append(YOUR PATH)
import encode_biological_sequence

# *******************************************************************
            
def cnn_one_layer(xtrainraw, ytrain, xtestraw, ytest, num_classes,
                  type_of_sequence, type_of_encoding, embedding_dimension,
                  num_epochs, batchsize,
                  conv_window_size, pooling_size,
                  verbosity, results_dir, class_names,
                  xprodraw, yprod):
    
    model_name = 'CNN_Conv_Window_Size_' + str(conv_window_size) + '_' + type_of_encoding.title()
    
    vocab_size = 50  # > number of possible unique words
    maximum_length = 0
    if type_of_encoding == 'embedding':
        # integer encode the peptides
                
        peptides_integer_encoded_train = []
        for pep in xtrainraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_train.append(pie)
        
        peptides_integer_encoded_test = []
        for pep in xtestraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_test.append(pie)
        
        peptides_integer_encoded_prod = []
        for pep in xprodraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_prod.append(pie)
                        
        # add padding so that all have the same length
        max_length_train = max(len(p) for p in peptides_integer_encoded_train)
        max_length_test = max(len(p) for p in peptides_integer_encoded_test)
        maximum_length = max(max_length_train, max_length_test)
        assert(vocab_size > maximum_length)
        
        xtrain_cnn = pad_sequences(peptides_integer_encoded_train, 
                                   maxlen=maximum_length, padding='post')
        xtest_cnn = pad_sequences(peptides_integer_encoded_test, 
                                  maxlen=maximum_length, padding='post')
        xprod_cnn = pad_sequences(peptides_integer_encoded_prod, 
                                  maxlen=maximum_length, padding='post')
    else:
        raise NameError
                  
    # create the cnn
    num_output_nodes = num_classes
    layer_1_units = 32
    model_cnn = Sequential()
    model_cnn.add(Embedding(vocab_size, embedding_dimension, 
                           input_length=maximum_length))
    model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu'))
    model_cnn.add(MaxPooling1D(pool_size=pooling_size))
    model_cnn.add(Flatten())
    model_cnn.add(Dense(num_output_nodes, activation='softmax'))
    # compile the model
    model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
    # summarize the model
    print('\n** results for simple cnn:')
    print(model_cnn.summary())
    # fit the model
    h = model_cnn.fit(xtrain_cnn, ytrain, epochs=num_epochs, 
                            batch_size=batchsize,
                            validation_data=(xtest_cnn, ytest), 
                            verbose=verbosity)
    
    start_epoch=10
    # plot loss
    plot_title = model_name.replace('_',' ') + ' Loss'
    train_validation_plots.plot_train_validation(h.history['loss'], 'Training Loss',  
              h.history['val_loss'], 'Validation Loss',
              start_epoch,
              'Epochs', 'Loss',
              plot_title, results_dir, False)
    
    # plot accuracy
    plot_title = model_name.replace('_',' ') + ' Accuracy'
    train_validation_plots.plot_train_validation(h.history['acc'], 'Training Accuracy',  
              h.history['val_acc'], 'Validation Accuracy',
              start_epoch,
              'Epochs', 'Accuracy',
              plot_title, results_dir, False)
  
    y_test = np.argmax(ytest, axis=1)
    y_test_predict = np.argmax(model_cnn.predict(xtest_cnn), axis=1)
    
    y_prod = np.argmax(yprod, axis=1)
    y_prod_predict = np.argmax(model_cnn.predict(xprod_cnn), axis=1)
       
    # compute and print error metrics to model_name.txt
    stdout_default = sys.stdout
    sys.stdout = open(results_dir + model_name + '.txt','w')
    print(model_cnn.summary(),'\n***\n\n')
    # test metrics
    balanced_accuracy = balanced_accuracy_score(y_test, y_test_predict)
    accuracy = accuracy_score(y_test, y_test_predict, normalize=True)  # fractions
    print('test metrics: balanced accuracy =', balanced_accuracy)
    print('test metrics: accuracy =',accuracy)
    print('test classification report:\n',classification_report(y_test, y_test_predict))
    
    # production metrics
    balanced_accuracy = balanced_accuracy_score(y_prod, y_prod_predict)
    accuracy = accuracy_score(y_prod, y_prod_predict, normalize=True)  # fractions
    print('\n\nproduction metrics: balanced accuracy =', balanced_accuracy)
    print('production metrics: accuracy =',accuracy)
    print('production classification report:\n',classification_report(y_prod, y_prod_predict))
    sys.stdout = stdout_default
       
    # compute confusion matrices for production predictions
    confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir,
                                  False, False, 30)  # raw

    confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir,
                                  False, True, 30)  # normalized
    
# end of cnn_one_layer()

# *******************************************************************
            
def cnn_one_layer_check_early_stop_cv(xcalibraw, ycalib, num_classes,
                  type_of_sequence, type_of_encoding, embedding_dimension,
                  num_epochs, batchsize,
                  conv_window_size, pooling_size,
                  verbosity, results_dir, class_names,
                  xprodraw, yprod, stop_epochs, folds):
    
    model_name = 'CNN_1_Conv_Window_Size_' + str(conv_window_size) + '_' + type_of_encoding.title()
   
    vocab_size = 50  # > number of possible unique words
    maximum_length = 0
    if type_of_encoding == 'embedding':
        # integer encode the peptides               
        peptides_integer_encoded_calib = []
        for pep in xcalibraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_calib.append(pie)
        
        peptides_integer_encoded_prod = []
        for pep in xprodraw:
            pie = encode_biological_sequence.encode_sequence(pep,
                  type_of_sequence, type_of_encoding)
            peptides_integer_encoded_prod.append(pie)
                       
        # add padding so that all have the same length
        maximum_length = max(len(p) for p in peptides_integer_encoded_calib)
        assert(vocab_size > maximum_length)
        
        xcalib_cnn = pad_sequences(peptides_integer_encoded_calib, 
                                   maxlen=maximum_length, padding='post')
        xprod_cnn = pad_sequences(peptides_integer_encoded_prod, 
                                  maxlen=maximum_length, padding='post')
    else:
        raise NameError
                
    # set up the cnn
    metric = 'acc'
    model_file = model_name + '.h5'
    callbacks_list = [
            EarlyStopping(monitor='val_loss', patience=stop_epochs),
            ModelCheckpoint(filepath=model_file, monitor='val_loss', save_best_only=True)]
    num_output_nodes = num_classes
    layer_1_units = 32
    
    list_cv_metrics = []
    list_cv_losses = []
    list_cv_epochs = []
    cv_num = int(xcalib_cnn.shape[0]/folds)
    for fold in range(folds):
        # create the cnn
        model_cnn = Sequential()
        model_cnn.add(Embedding(vocab_size, embedding_dimension, 
                               input_length=maximum_length))
        model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, 
                             activation='relu'))
        model_cnn.add(MaxPooling1D(pool_size=pooling_size))
        model_cnn.add(Flatten())
        model_cnn.add(Dense(num_output_nodes, activation='softmax'))
        # compile the model
        model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', 
                          metrics=[metric])
        
        # get train/valid
        x_train = np.vstack((xcalib_cnn[:cv_num*fold], xcalib_cnn[cv_num*(fold+1):]))
        y_train = np.vstack((ycalib[:cv_num*fold], ycalib[cv_num*(fold+1):]))
        
        x_valid = xcalib_cnn[cv_num*fold:cv_num*(fold+1)]
        y_valid = ycalib[cv_num*fold:cv_num*(fold+1)]
        
        # fit the model
        h = model_cnn.fit(x_train, y_train, 
                          epochs=num_epochs, 
                          batch_size=batchsize,
                          validation_data=(x_valid, y_valid), 
                          callbacks=callbacks_list,
                          verbose=verbosity)
        
        # collect cv stats
        cv_loss, cv_metric = model_cnn.evaluate(x_valid, y_valid, verbose=0)
        list_cv_metrics.append(cv_metric)
        list_cv_losses.append(cv_loss)
        list_cv_epochs.append(len(h.history['val_loss']))
           
    # redo the model with the maximum epochs from the cv and all calib data, no early stopping
    model_cnn_final = Sequential()
    model_cnn_final.add(Embedding(vocab_size, embedding_dimension, 
                           input_length=maximum_length))
    model_cnn_final.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, 
                         activation='relu'))
    model_cnn_final.add(MaxPooling1D(pool_size=pooling_size))
    model_cnn_final.add(Flatten())
    model_cnn_final.add(Dense(num_output_nodes, activation='softmax'))
    # compile the model
    model_cnn_final.compile(optimizer='adam', loss='categorical_crossentropy', 
                            metrics=[metric])
    # summarize the model
    print('\n** final results for',model_name)
    # fit the model
    h = model_cnn_final.fit(xcalib_cnn, ycalib, 
                      epochs=np.max(list_cv_epochs), 
                      batch_size=batchsize,
                      verbose=verbosity)
        
    y_prod = np.argmax(yprod, axis=1)
    y_prod_predict = np.argmax(model_cnn_final.predict(xprod_cnn), axis=1)
        
    # compute and print error metrics to model_name.txt
    stdout_default = sys.stdout
    sys.stdout = open(results_dir + model_name + '.txt','w')
    print(model_cnn_final.summary(),'\n***\n\n')
    print('\ncross validation metrics:')
    for i in range(folds):
        print('fold:',i+1,'\terror:',list_cv_metrics[i],'\tloss:',list_cv_losses[i],
              '\tepochs:',list_cv_epochs[i])
        
    print('\nmax cross validation metric =',np.max(list_cv_metrics))
    print('min cross validation metric =',np.min(list_cv_metrics))
    print('mean cross validation error =',np.mean(list_cv_metrics))
    
    # production metrics
    balanced_accuracy = balanced_accuracy_score(y_prod, y_prod_predict)
    accuracy = accuracy_score(y_prod, y_prod_predict, normalize=True)  # fractions
    print('\n\nproduction metrics: balanced accuracy =', balanced_accuracy)
    print('production metrics: accuracy =',accuracy)
    print('production classification report:\n',classification_report(y_prod, y_prod_predict))
    sys.stdout = stdout_default
    
    
    # compute confusion matrices for production predictions
    confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir,
                                  False, False, 30)  # raw

    confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir,
                                  False, True, 30)  # normalized
    
# end of cnn_one_layer_check_early_stop_cv()

# *******************************************************************

if __name__ == '__main__':
    
    code_dir = YOUR PATH
    data_dir = code_dir + 'data/'
    
    df_train = pd.read_pickle(data_dir + 'df_train.pkl')
    df_test = pd.read_pickle(data_dir + 'df_test.pkl')
    df_production = pd.read_pickle(data_dir + 'df_production.pkl')
    
    number_of_classes = df_train['label_num'].value_counts().shape[0]
    sequence_type = 'amino_acid_20'
    encoding_type = 'embedding'
        
    dimension_of_embedding = 32
    
    epochs = 150
    batch_size = 128
    verbose = 0
    
    x_train = df_train['peptide'].values
    y_train = to_categorical(df_train['label_num'].values)
    
    x_test = df_test['peptide'].values
    y_test = to_categorical(df_test['label_num'].values)
    
    x_production = df_production['peptide'].values
    y_production = to_categorical(df_production['label_num'].values)
    
    df_calib = pd.concat([df_train, df_test])
    x_calib = df_calib['peptide'].values
    y_calib = to_categorical(df_calib['label_num'].values)
    
    class_labels = ['Non Binder','Weak Binder','Strong Binder']
    
    # cnn - single layer, no regularization, no cross valid, no early stopping
    pool_size = 3
    for convolution_window_size in [5,7]:
        results_directory = code_dir + 'cnn_1/conv_window_size_' + str(convolution_window_size) + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        cnn_one_layer(x_train, y_train, x_test, y_test, number_of_classes,
                      sequence_type, encoding_type, dimension_of_embedding,
                      epochs, batch_size, convolution_window_size, pool_size,
                      verbose, results_directory, class_labels,
                      x_production, y_production)
    
    # cnn - 1 layer, early stopping, checkpoint, cross valid, no regularization
    cv_folds = 5
    checkpoint_epochs_stop = 10
    pool_size = 3
    cnn_1_directory = code_dir + 'cnn_1_cv/'
    if not Path(cnn_1_directory).is_dir():
        os.mkdir(cnn_1_directory)
    for convolution_window_size in [5,7]:
        results_directory = cnn_1_directory + 'conv_window_size_' + str(convolution_window_size) + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
                        
        cnn_one_layer_check_early_stop_cv(x_calib, y_calib, number_of_classes,
                      sequence_type, encoding_type, dimension_of_embedding,
                      epochs, batch_size, convolution_window_size, pool_size,
                      verbose, results_directory, class_labels,
                      x_production, y_production, checkpoint_epochs_stop, cv_folds)

***************************************************
Loss and accuracy plotting code:

import matplotlib.pyplot as plt

def plot_train_validation(train_data, train_label,  
                          valid_data, valid_label,
                          start_num,
                          xlabel, ylabel,
                          title, directory, show_plot):
    
    plt.ioff()
    fig, ax = plt.subplots()
    
    plt.plot(train_data[start_num:], 'bo', label=train_label)
    plt.plot(valid_data[start_num:], 'ro', label=valid_label)
    ax.set_xlim(start_num, )
    plt.xlabel(xlabel, fontsize=15)
    plt.ylabel(ylabel, fontsize=15)
    plt.legend()
    plt.title(title, fontsize=15)
    plt.grid()
    
    plt.savefig(directory + title + '.png')
    plt.clf()
    plt.close('all')

***************************************************
Confusion matrix plotting code:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix

def cm_plot(y_true, y_pred, class_names, plot_dir,
            show_plot=False, normalize=False, xangle=30):
    
    # set color map and precision (for normalized cm)
    cmap=plt.cm.Blues
    np.set_printoptions(precision=3)
    
    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        title = 'Normalized Confusion Matrix'
    else:
        title = 'Confusion Matrix'

    plt.ioff()
    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           xticklabels=class_names, yticklabels=class_names,
           title=title,
           ylabel='True Label',
           xlabel='Predicted Label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=xangle, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
        
    #save plot
    plt.savefig(plot_dir + title + '.png')
    plt.clf()
    plt.close(fig)

Peptide Binding – Part 1: 1D Convolutional Neural Network

1. Introduction

2. Data

3. Simple 1D CNN

4. Simple 1D CNN With Cross Validation And Early Stopping

5. Code

Affiliate Disclaimer