1. Introduction
In Deep Learning for Cancer Immunotherapy, Leon Eyrich Jessen used various machine learning algorithms to classify peptide binding strength into three classes, ‘strong binder’ SB, ‘weak binder’ WB or ‘non-binder’ NB. The author used 9-mer peptides, sequences of 9 amino acids, as raw input data. This data was created and transformed as follows:
The input data for this use case was created by generating 1,000,000 random 9-mer peptides by sampling the one-letter code for the 20 amino acids, i.e. ARNDCQEGHILKMFPSTWYV, and then submitting the peptides to MHCI binding prediction using the current state-of-the-art model netMHCpan. Different variants of MHCI exists, so for this case we chose HLA-A*02:01.
… each peptide will be encoded by assigning a vector of 20 values, where each value is the probability of the amino acid mutating into 1 of the 20 others as defined by the BLOSUM62 matrix using the pep_encode() function from the PepTools package. This way each peptide is converted to an ‘image’ matrix with 9 rows and 20 columns.
This is then modeling a model.
After converting the data in this way, it was scaled to values between 0 and 1. Then multilayer perceptron (MLP), 2D convolutional neural network, and random forest algorithms were used (the 2D data was flattened for the MLP and random forest).
See the article for results and discussion about the biology behind the problem.
Our article is the first in a series, where we will explore this problem via a variety of methods. We begin with a 1D convolutional neural network.
2. Data
The original raw data can be found here. In case this link dies, we provide a zipped csv file.
Below is a snippet of the data:
peptide | label_chr | label_num | data_type |
---|---|---|---|
LLTDAQRIV | WB | 1 | train |
LMAFYLYEV | SB | 2 | train |
VMSPITLPT | WB | 1 | test |
SLHLTNCFV | WB | 1 | train |
RQFTCMIAV | WB | 1 | train |
The following code deletes the label_chr column, and divides the data into calibration (which is then divided into train and test) and production, with the latter serving as new, unseen data.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.utils import shuffle def create_data(): # get raw data dfraw = pd.read_csv(data_dir + 'raw_data.csv') dfraw.drop(columns=['label_chr'], inplace=True) # split into calibration (train) and production (test) type_group = dfraw.groupby('data_type') for group_name, dfgroup in type_group: dfgroup.drop(columns=['data_type'], inplace=True) if group_name == 'train': dfgroup = shuffle(dfgroup) dfgroup.to_pickle(data_dir + 'df_calibration.pkl') # header = ['peptide', 'label_num'] print('calibration data size:',dfgroup.shape) # split calib into train/test (80/20) df_train, df_test = train_test_split(dfgroup, train_size=0.8, shuffle=True) df_train.to_pickle(data_dir + 'df_train.pkl') print('train data size:',df_train.shape) df_test.to_pickle(data_dir + 'df_test.pkl') print('test data size:',df_test.shape) elif group_name == 'test': dfgroup.to_pickle(data_dir + 'df_production.pkl') print('production data size:',dfgroup.shape)
The output is:
- production data size: (2376, 2)
- calibration data size: (21384, 2)
- train data size: (17107, 2)
- test data size: (4277, 2)
Peptides are represented as strings of letters. For this problem, the alphabet is restricted to the 20 letters that are abbreviations for amino acids directly encoded by genes in humans.

Image from: https://i2.wp.com/www.compoundchem.com/wp-content/uploads/2014/09/20-Common-Amino-Acids-v3.png
We represent peptides via character level embedding. First, letters are assigned unique integers using the following code:
import numpy as np def encode_sequence(sequence, sequence_type, encoding_type): if sequence_type == 'dna': letters = 'ACGT' elif sequence_type == 'amino_acid_20': letters = 'ACDEFGHIKLMNPQRSTVWY' elif sequence_type == 'amino_acid_22': letters = 'ACDEFGHIKLMNOPQRSTUVWY' else: print('\ninvalid sequence_type in encode_sequence()',sequence_type) print('in encode_biological_sequence.py') raise NameError num_letters = len(letters) if encoding_type == 'one_hot': mapping = dict(zip(letters, range(num_letters))) sequence_mapped = [mapping[i] for i in sequence] transformed_sequence = np.eye(num_letters)[sequence_mapped] elif encoding_type == 'decimal_vector': mapping = dict(zip(letters, np.linspace(1/num_letters,1, num=num_letters))) transformed_sequence = [mapping[i] for i in sequence] elif encoding_type == 'embedding': # reserve 0 for padding mapping = dict(zip(letters, range(1, num_letters+1))) transformed_sequence = [mapping[i] for i in sequence] else: print('\ninvalid encoding_type in encode_sequence()',encoding_type) print('in encode_biological_sequence.py') raise NameError return transformed_sequence
For example, this code yields:
LLTDAQRIV = [10, 10, 17, 3, 1, 14, 15, 8, 18]
LMAFYLYEV = [10, 11, 1, 5, 20, 10, 20, 4, 18]
VMSPITLPT = [18, 11, 16, 13, 8, 17, 10, 13, 17]
In the following piece of code, the above function is encode_biological_sequence.encode_sequence(), which performs the transformation above. Then using the reserved integer 0, the integer representations is padded out if the peptide lengths are different (0s are added to the end). Finally, the Keras embedding layer is used so that the CNN can learn an internal representation of the peptides as part of the overall learning process of assigning peptide binding strengths. Thus feature importances are computed and utilized automatically.
from keras.layers import Embedding from keras.preprocessing.sequence import pad_sequences import encode_biological_sequence def cnn_one_layer(xtrainraw, ... ): embedding_dimension = 32 vocab_size = 50 # > number of possible unique words maximum_length = 0 if type_of_encoding == 'embedding': # integer encode the peptides peptides_integer_encoded_train = [] for pep in xtrainraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_train.append(pie) peptides_integer_encoded_test = [] for pep in xtestraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_test.append(pie) peptides_integer_encoded_prod = [] for pep in xprodraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_prod.append(pie) # add padding so that all have the same length max_length_train = max(len(p) for p in peptides_integer_encoded_train) max_length_test = max(len(p) for p in peptides_integer_encoded_test) maximum_length = max(max_length_train, max_length_test) assert(vocab_size > maximum_length) xtrain_cnn = pad_sequences(peptides_integer_encoded_train, maxlen=maximum_length, padding='post') xtest_cnn = pad_sequences(peptides_integer_encoded_test, maxlen=maximum_length, padding='post') xprod_cnn = pad_sequences(peptides_integer_encoded_prod, maxlen=maximum_length, padding='post') else: raise NameError # create the cnn model_cnn.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length))
The variables xtrainraw, xtestraw, xprodraw are the peptides represented as strings, like ‘LLTDAQRIV’.
The embedding dimension was determined arbitrarily. It will be optimized in a later article.
The targets are binding affinity with T cell receptors which are represented by the three classes: NB (Non Binder) = 0, WB (Weak Binder) = 1, SB (Strong Binder) = 2. We then use one hot encoding:
from keras.utils.np_utils import to_categorical y_train = to_categorical(df_train['label_num'].values)
so that 0 = [1., 0., 0.], 1 = [0., 1., 0.], and 2 = [0., 0., 1.].
3. Simple 1D CNN
We use a simple 1d CNN for this classification problem. The architecture is simple, there is no regularization, or attempt to optimize any parameter. The goal here is to establish a baseline model. Enhancements will be applied in a later article.
from keras.models import Sequential from keras.layers import Dense, Embedding from keras.layers import Conv1D, MaxPooling1D, Flatten def cnn_one_layer(xtrainraw, ... ): num_output_nodes = num_classes layer_1_units = 32 model_cnn = Sequential() model_cnn.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu')) model_cnn.add(MaxPooling1D(pool_size=pooling_size)) model_cnn.add(Flatten()) model_cnn.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc']) # summarize the model print('\n** results for simple cnn:') print(model_cnn.summary()) # fit the model h = model_cnn.fit(xtrain_cnn, ytrain, epochs=num_epochs, batch_size=batchsize, validation_data=(xtest_cnn, ytest), verbose=verbosity)
There are 3 classes which determines the size of the output layer as we are using one hot encoding of the classes. For the kernel size in the convolutional layer, we tried 5 and 7 with 7 yielding the best result. Also, the pool size was set to 3.
Here is a view of the CNN:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 9, 32) 1600
_________________________________________________________________
conv1d_2 (Conv1D) (None, 3, 32) 7200
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 1, 32) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 32) 0
_________________________________________________________________
dense_2 (Dense) (None, 3) 99
=================================================================
Total params: 8,899
Trainable params: 8,899
Non-trainable params: 0
_________________________________________________________________
Here are plots of the loss and accuracy (note that the classes are well balanced):
4. Simple 1D CNN With Cross Validation And Early Stopping
We use the same CNN as above and implement 5 fold cross validation. Additionally, based on the loss and accuracy curves, we add early stopping, so that training is stopped if the validation loss does not decrease in 10 epochs. Since we are not optimizing parameters, we use the maximum number of epochs for the folds to run the model again and use it to predict the classification of production (unseen) data.
metric = 'acc' model_file = model_name + '.h5' callbacks_list = [ EarlyStopping(monitor='val_loss', patience=stop_epochs), ModelCheckpoint(filepath=model_file, monitor='val_loss', save_best_only=True)] num_output_nodes = num_classes layer_1_units = 32 list_cv_metrics = [] list_cv_losses = [] list_cv_epochs = [] cv_num = int(xcalib_cnn.shape[0]/folds) for fold in range(folds): # create the cnn model_cnn = Sequential() model_cnn.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu')) model_cnn.add(MaxPooling1D(pool_size=pooling_size)) model_cnn.add(Flatten()) model_cnn.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[metric]) # get train/valid x_train = np.vstack((xcalib_cnn[:cv_num*fold], xcalib_cnn[cv_num*(fold+1):])) y_train = np.vstack((ycalib[:cv_num*fold], ycalib[cv_num*(fold+1):])) x_valid = xcalib_cnn[cv_num*fold:cv_num*(fold+1)] y_valid = ycalib[cv_num*fold:cv_num*(fold+1)] # fit the model h = model_cnn.fit(x_train, y_train, epochs=num_epochs, batch_size=batchsize, validation_data=(x_valid, y_valid), callbacks=callbacks_list, verbose=verbosity) # collect cv stats cv_loss, cv_metric = model_cnn.evaluate(x_valid, y_valid, verbose=0) list_cv_metrics.append(cv_metric) list_cv_losses.append(cv_loss) list_cv_epochs.append(len(h.history['val_loss']))
# redo the model with the maximum epochs from the cv and all calib data, no early stopping model_cnn_final = Sequential() model_cnn_final.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) model_cnn_final.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu')) model_cnn_final.add(MaxPooling1D(pool_size=pooling_size)) model_cnn_final.add(Flatten()) model_cnn_final.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_cnn_final.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[metric]) # summarize the model print('\n** final results for',model_name) # fit the model h = model_cnn_final.fit(xcalib_cnn, ycalib, epochs=np.max(list_cv_epochs), batch_size=batchsize, verbose=verbosity) y_prod = np.argmax(yprod, axis=1) y_prod_predict = np.argmax(model_cnn_final.predict(xprod_cnn), axis=1)
Note that by using one hot encoding of the classes and softmax in the output layer, the predictions are class probabilities. These probabilities are then set to a definite class via plurality voting. This is done in the last two lines of the code.
Here are the results of the final 1d CNN:
Cross validation metrics:
Fold | Accuracy | Loss | Epochs |
---|---|---|---|
1 | 0.90388 | 0.24380 | 49 |
2 | 0.90341 | 0.23927 | 45 |
3 | 0.90598 | 0.23742 | 42 |
4 | 0.90598 | 0.22416 | 50 |
5 | 0.90739 | 0.21724 | 45 |
max cross validation metric = 0.90739
min cross validation metric = 0.90341
mean cross validation error = 0.90533
production metrics: balanced accuracy = 0.91645
production metrics: accuracy = 0.91624
Production classification report:
Precision | Recall | F1 Score | Support | |
---|---|---|---|---|
0 | 0.98 | 0.97 | 0.97 | 782 |
1 | 0.89 | 0.86 | 0.87 | 792 |
2 | 0.88 | 0.92 | 0.90 | 802 |
Accuracy | 0.92 | 2376 | ||
Macro Average | 0.92 | 0.92 | 0.92 | 2376 |
Weighted Average | 0.92 | 0.92 | 0.92 | 2376 |


Our final result of 91.62% compares favorably with the results reported in the article, 94.8% for a MLP, 92% for a 2d CNN, and 82.2% for random forest. Recall that the paper used additional biological information to enhance the data, but also no optimization or regularization was attempted. Unfortunately, the article only shows confusion matrices graphically without numbers, so we cannot compare them with our results.
5. Code
Below are listings of the codes used in this article.
1D CNN code:
import os import pandas as pd import numpy as np import sys from pathlib import Path from keras.models import Sequential from keras.layers import Dense, Embedding from keras.layers import SimpleRNN, GRU, LSTM, Bidirectional from keras.layers import Conv1D, MaxPooling1D, Flatten from keras.utils.np_utils import to_categorical from keras.preprocessing.sequence import pad_sequences from keras.callbacks import ModelCheckpoint, EarlyStopping from sklearn.metrics import balanced_accuracy_score from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report sys.path.append(YOUR PATH) import confusion_matrix_plot import train_validation_plots sys.path.append(YOUR PATH) import encode_biological_sequence # ******************************************************************* def cnn_one_layer(xtrainraw, ytrain, xtestraw, ytest, num_classes, type_of_sequence, type_of_encoding, embedding_dimension, num_epochs, batchsize, conv_window_size, pooling_size, verbosity, results_dir, class_names, xprodraw, yprod): model_name = 'CNN_Conv_Window_Size_' + str(conv_window_size) + '_' + type_of_encoding.title() vocab_size = 50 # > number of possible unique words maximum_length = 0 if type_of_encoding == 'embedding': # integer encode the peptides peptides_integer_encoded_train = [] for pep in xtrainraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_train.append(pie) peptides_integer_encoded_test = [] for pep in xtestraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_test.append(pie) peptides_integer_encoded_prod = [] for pep in xprodraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_prod.append(pie) # add padding so that all have the same length max_length_train = max(len(p) for p in peptides_integer_encoded_train) max_length_test = max(len(p) for p in peptides_integer_encoded_test) maximum_length = max(max_length_train, max_length_test) assert(vocab_size > maximum_length) xtrain_cnn = pad_sequences(peptides_integer_encoded_train, maxlen=maximum_length, padding='post') xtest_cnn = pad_sequences(peptides_integer_encoded_test, maxlen=maximum_length, padding='post') xprod_cnn = pad_sequences(peptides_integer_encoded_prod, maxlen=maximum_length, padding='post') else: raise NameError # create the cnn num_output_nodes = num_classes layer_1_units = 32 model_cnn = Sequential() model_cnn.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu')) model_cnn.add(MaxPooling1D(pool_size=pooling_size)) model_cnn.add(Flatten()) model_cnn.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc']) # summarize the model print('\n** results for simple cnn:') print(model_cnn.summary()) # fit the model h = model_cnn.fit(xtrain_cnn, ytrain, epochs=num_epochs, batch_size=batchsize, validation_data=(xtest_cnn, ytest), verbose=verbosity) start_epoch=10 # plot loss plot_title = model_name.replace('_',' ') + ' Loss' train_validation_plots.plot_train_validation(h.history['loss'], 'Training Loss', h.history['val_loss'], 'Validation Loss', start_epoch, 'Epochs', 'Loss', plot_title, results_dir, False) # plot accuracy plot_title = model_name.replace('_',' ') + ' Accuracy' train_validation_plots.plot_train_validation(h.history['acc'], 'Training Accuracy', h.history['val_acc'], 'Validation Accuracy', start_epoch, 'Epochs', 'Accuracy', plot_title, results_dir, False) y_test = np.argmax(ytest, axis=1) y_test_predict = np.argmax(model_cnn.predict(xtest_cnn), axis=1) y_prod = np.argmax(yprod, axis=1) y_prod_predict = np.argmax(model_cnn.predict(xprod_cnn), axis=1) # compute and print error metrics to model_name.txt stdout_default = sys.stdout sys.stdout = open(results_dir + model_name + '.txt','w') print(model_cnn.summary(),'\n***\n\n') # test metrics balanced_accuracy = balanced_accuracy_score(y_test, y_test_predict) accuracy = accuracy_score(y_test, y_test_predict, normalize=True) # fractions print('test metrics: balanced accuracy =', balanced_accuracy) print('test metrics: accuracy =',accuracy) print('test classification report:\n',classification_report(y_test, y_test_predict)) # production metrics balanced_accuracy = balanced_accuracy_score(y_prod, y_prod_predict) accuracy = accuracy_score(y_prod, y_prod_predict, normalize=True) # fractions print('\n\nproduction metrics: balanced accuracy =', balanced_accuracy) print('production metrics: accuracy =',accuracy) print('production classification report:\n',classification_report(y_prod, y_prod_predict)) sys.stdout = stdout_default # compute confusion matrices for production predictions confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir, False, False, 30) # raw confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir, False, True, 30) # normalized # end of cnn_one_layer() # ******************************************************************* def cnn_one_layer_check_early_stop_cv(xcalibraw, ycalib, num_classes, type_of_sequence, type_of_encoding, embedding_dimension, num_epochs, batchsize, conv_window_size, pooling_size, verbosity, results_dir, class_names, xprodraw, yprod, stop_epochs, folds): model_name = 'CNN_1_Conv_Window_Size_' + str(conv_window_size) + '_' + type_of_encoding.title() vocab_size = 50 # > number of possible unique words maximum_length = 0 if type_of_encoding == 'embedding': # integer encode the peptides peptides_integer_encoded_calib = [] for pep in xcalibraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_calib.append(pie) peptides_integer_encoded_prod = [] for pep in xprodraw: pie = encode_biological_sequence.encode_sequence(pep, type_of_sequence, type_of_encoding) peptides_integer_encoded_prod.append(pie) # add padding so that all have the same length maximum_length = max(len(p) for p in peptides_integer_encoded_calib) assert(vocab_size > maximum_length) xcalib_cnn = pad_sequences(peptides_integer_encoded_calib, maxlen=maximum_length, padding='post') xprod_cnn = pad_sequences(peptides_integer_encoded_prod, maxlen=maximum_length, padding='post') else: raise NameError # set up the cnn metric = 'acc' model_file = model_name + '.h5' callbacks_list = [ EarlyStopping(monitor='val_loss', patience=stop_epochs), ModelCheckpoint(filepath=model_file, monitor='val_loss', save_best_only=True)] num_output_nodes = num_classes layer_1_units = 32 list_cv_metrics = [] list_cv_losses = [] list_cv_epochs = [] cv_num = int(xcalib_cnn.shape[0]/folds) for fold in range(folds): # create the cnn model_cnn = Sequential() model_cnn.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) model_cnn.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu')) model_cnn.add(MaxPooling1D(pool_size=pooling_size)) model_cnn.add(Flatten()) model_cnn.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[metric]) # get train/valid x_train = np.vstack((xcalib_cnn[:cv_num*fold], xcalib_cnn[cv_num*(fold+1):])) y_train = np.vstack((ycalib[:cv_num*fold], ycalib[cv_num*(fold+1):])) x_valid = xcalib_cnn[cv_num*fold:cv_num*(fold+1)] y_valid = ycalib[cv_num*fold:cv_num*(fold+1)] # fit the model h = model_cnn.fit(x_train, y_train, epochs=num_epochs, batch_size=batchsize, validation_data=(x_valid, y_valid), callbacks=callbacks_list, verbose=verbosity) # collect cv stats cv_loss, cv_metric = model_cnn.evaluate(x_valid, y_valid, verbose=0) list_cv_metrics.append(cv_metric) list_cv_losses.append(cv_loss) list_cv_epochs.append(len(h.history['val_loss'])) # redo the model with the maximum epochs from the cv and all calib data, no early stopping model_cnn_final = Sequential() model_cnn_final.add(Embedding(vocab_size, embedding_dimension, input_length=maximum_length)) model_cnn_final.add(Conv1D(filters=layer_1_units, kernel_size=conv_window_size, activation='relu')) model_cnn_final.add(MaxPooling1D(pool_size=pooling_size)) model_cnn_final.add(Flatten()) model_cnn_final.add(Dense(num_output_nodes, activation='softmax')) # compile the model model_cnn_final.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[metric]) # summarize the model print('\n** final results for',model_name) # fit the model h = model_cnn_final.fit(xcalib_cnn, ycalib, epochs=np.max(list_cv_epochs), batch_size=batchsize, verbose=verbosity) y_prod = np.argmax(yprod, axis=1) y_prod_predict = np.argmax(model_cnn_final.predict(xprod_cnn), axis=1) # compute and print error metrics to model_name.txt stdout_default = sys.stdout sys.stdout = open(results_dir + model_name + '.txt','w') print(model_cnn_final.summary(),'\n***\n\n') print('\ncross validation metrics:') for i in range(folds): print('fold:',i+1,'\terror:',list_cv_metrics[i],'\tloss:',list_cv_losses[i], '\tepochs:',list_cv_epochs[i]) print('\nmax cross validation metric =',np.max(list_cv_metrics)) print('min cross validation metric =',np.min(list_cv_metrics)) print('mean cross validation error =',np.mean(list_cv_metrics)) # production metrics balanced_accuracy = balanced_accuracy_score(y_prod, y_prod_predict) accuracy = accuracy_score(y_prod, y_prod_predict, normalize=True) # fractions print('\n\nproduction metrics: balanced accuracy =', balanced_accuracy) print('production metrics: accuracy =',accuracy) print('production classification report:\n',classification_report(y_prod, y_prod_predict)) sys.stdout = stdout_default # compute confusion matrices for production predictions confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir, False, False, 30) # raw confusion_matrix_plot.cm_plot(y_prod, y_prod_predict, class_names, results_dir, False, True, 30) # normalized # end of cnn_one_layer_check_early_stop_cv() # ******************************************************************* if __name__ == '__main__': code_dir = YOUR PATH data_dir = code_dir + 'data/' df_train = pd.read_pickle(data_dir + 'df_train.pkl') df_test = pd.read_pickle(data_dir + 'df_test.pkl') df_production = pd.read_pickle(data_dir + 'df_production.pkl') number_of_classes = df_train['label_num'].value_counts().shape[0] sequence_type = 'amino_acid_20' encoding_type = 'embedding' dimension_of_embedding = 32 epochs = 150 batch_size = 128 verbose = 0 x_train = df_train['peptide'].values y_train = to_categorical(df_train['label_num'].values) x_test = df_test['peptide'].values y_test = to_categorical(df_test['label_num'].values) x_production = df_production['peptide'].values y_production = to_categorical(df_production['label_num'].values) df_calib = pd.concat([df_train, df_test]) x_calib = df_calib['peptide'].values y_calib = to_categorical(df_calib['label_num'].values) class_labels = ['Non Binder','Weak Binder','Strong Binder'] # cnn - single layer, no regularization, no cross valid, no early stopping pool_size = 3 for convolution_window_size in [5,7]: results_directory = code_dir + 'cnn_1/conv_window_size_' + str(convolution_window_size) + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) cnn_one_layer(x_train, y_train, x_test, y_test, number_of_classes, sequence_type, encoding_type, dimension_of_embedding, epochs, batch_size, convolution_window_size, pool_size, verbose, results_directory, class_labels, x_production, y_production) # cnn - 1 layer, early stopping, checkpoint, cross valid, no regularization cv_folds = 5 checkpoint_epochs_stop = 10 pool_size = 3 cnn_1_directory = code_dir + 'cnn_1_cv/' if not Path(cnn_1_directory).is_dir(): os.mkdir(cnn_1_directory) for convolution_window_size in [5,7]: results_directory = cnn_1_directory + 'conv_window_size_' + str(convolution_window_size) + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) cnn_one_layer_check_early_stop_cv(x_calib, y_calib, number_of_classes, sequence_type, encoding_type, dimension_of_embedding, epochs, batch_size, convolution_window_size, pool_size, verbose, results_directory, class_labels, x_production, y_production, checkpoint_epochs_stop, cv_folds)
***************************************************
Loss and accuracy plotting code:
import matplotlib.pyplot as plt def plot_train_validation(train_data, train_label, valid_data, valid_label, start_num, xlabel, ylabel, title, directory, show_plot): plt.ioff() fig, ax = plt.subplots() plt.plot(train_data[start_num:], 'bo', label=train_label) plt.plot(valid_data[start_num:], 'ro', label=valid_label) ax.set_xlim(start_num, ) plt.xlabel(xlabel, fontsize=15) plt.ylabel(ylabel, fontsize=15) plt.legend() plt.title(title, fontsize=15) plt.grid() plt.savefig(directory + title + '.png') plt.clf() plt.close('all')
***************************************************
Confusion matrix plotting code:
import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix def cm_plot(y_true, y_pred, class_names, plot_dir, show_plot=False, normalize=False, xangle=30): # set color map and precision (for normalized cm) cmap=plt.cm.Blues np.set_printoptions(precision=3) # Compute confusion matrix cm = confusion_matrix(y_true, y_pred) if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] title = 'Normalized Confusion Matrix' else: title = 'Confusion Matrix' plt.ioff() fig, ax = plt.subplots() im = ax.imshow(cm, interpolation='nearest', cmap=cmap) ax.figure.colorbar(im, ax=ax) ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]), xticklabels=class_names, yticklabels=class_names, title=title, ylabel='True Label', xlabel='Predicted Label') # Rotate the tick labels and set their alignment. plt.setp(ax.get_xticklabels(), rotation=xangle, ha="right", rotation_mode="anchor") # Loop over data dimensions and create text annotations. fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. for i in range(cm.shape[0]): for j in range(cm.shape[1]): ax.text(j, i, format(cm[i, j], fmt), ha="center", va="center", color="white" if cm[i, j] > thresh else "black") fig.tight_layout() #save plot plt.savefig(plot_dir + title + '.png') plt.clf() plt.close(fig)