Hyperparameter Search With GPyOpt: Part 1 – Scikit-learn Classification and Ensembling
Hyperparameter Search With GPyOpt: Part 2 – XGBoost Classification and Ensembling

  1. GPyOpt Python Package
  2. Using GPyOpt
  3. Best Result and Ensembling
  4. Results

1. GPyOpt Python Package

GPyOpt is a Python open-source library for Bayesian Optimization developed by the Machine Learning group of the University of Sheffield. It is based on GPy, a Python framework for Gaussian process modelling.

In this article, we demonstrate how to use this package to perform hyperparameter search for a classification problem with Keras.

2. Using GPyOpt

Below are code fragments showing how to integrate the package with Keras.

We begin by specifying if the problem is one of minimization or maximization. Here, we are again using the MNIST handwritten digits data set for our classification problem. Thus we wish to minimize the validation categorical crossentropy loss by setting maximize_bool = False.

Then we set the search space as a single element list of dictionaries in which we label the variable names, set variable types to continuous or discrete, and specify their range of values.

search_space = [
        {'name':'num_cnn_blocks', 'type':'discrete', 'domain':(2,3,4)},
        {'name':'num_filters', 'type':'discrete', 'domain':(16,32,48,64)},
        {'name':'kernel_size', 'type':'discrete', 'domain':(2,3,4)},
        {'name':'num_dense_nodes', 'type':'discrete', 
         'domain':(64,96,128,160,192,224,256)},
        {'name':'dense_nodes_divisor', 'type':'discrete', 'domain':(2,4,8)},
        {'name':'batch_size', 'type':'discrete', 'domain':(32,64,96,128)},
        {'name':'drop_out', 'type':'continuous', 'domain':(0.05, 0.5)}]

Now we create our CNN:


def cnn_function(params):
    model_name = ml_algo_name + '_' + str(np.random.uniform(0,1,))[2:9]
    
    dict_params = {'num_cnn_blocks':int(params[0][0]),
                   'num_filters':int(params[0][1]),
                   'kernel_size':int(params[0][2]),
                   'num_dense_nodes':int(params[0][3]),
                   'dense_nodes_divisor':int(params[0][4]),
                   'batch_size':int(params[0][5]),
                   'drop_out':params[0][6]}
      
    input_tensor = Input(shape=input_shape)
    
    x = BatchNormalization()(input_tensor)
    x = Activation('relu')(x)
    x = Conv2D(filters=dict_params['num_filters'],
               kernel_size=dict_params['kernel_size'],
               strides=1, padding='same')(x)
    x = Dropout(dict_params['drop_out'])(x)
    
    for iblock in range(dict_params['num_cnn_blocks'] - 1):
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        x = Conv2D(filters=dict_params['num_filters'],
                   kernel_size=dict_params['kernel_size'],
                   strides=1, padding='same')(x)
        x = MaxPooling2D()(x)
        x = Dropout(dict_params['drop_out'])(x)
         
    x = Flatten()(x)
    x = Dense(dict_params['num_dense_nodes'], activation='relu')(x)
    x = Dropout(dict_params['drop_out'])(x)
    x = Dense(dict_params['num_dense_nodes']//dict_params['dense_nodes_divisor'], 
              activation='relu')(x)
    output_tensor = Dense(number_of_classes, activation='softmax')(x)
    
    cnn_model = Model(inputs=input_tensor, outputs=output_tensor)
    opt = Adam(lr=0.00025)
    cnn_model.compile(loss='categorical_crossentropy',
                      optimizer=opt, metrics=['accuracy'])
    
    
    callbacks_list = [EarlyStopping(monitor='val_loss', patience=early_stop_epochs),                     
                      ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
                                        patience=learning_rate_epochs, 
                                        verbose=0, mode='auto', min_lr=1.0e-6),
                      ModelCheckpoint(filepath=results_dir + model_name + '.h5',
                                      monitor='val_loss', save_best_only=True)]
  
    h = cnn_model.fit(x=xcalib, y=ycalib,
                      batch_size=dict_params['batch_size'],
                      epochs=maximum_epochs,
                      validation_split=0.25,
                      shuffle=True, verbose=0,
                      callbacks=callbacks_list)
       
    # record actual best epochs and valid loss here, added to bayes opt parameter df below
    list_early_stop_epochs.append(len(h.history['val_loss']) - early_stop_epochs)
    
    validation_loss = np.min(h.history['val_loss'])
    list_validation_loss.append(validation_loss)
    list_saved_model_name.append(model_name)
    
    return validation_loss

params is the search space list of dictionaries. Note that the second element must match the order specified in search_space.

We create our Bayesian optimization model

gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=cnn_function, domain=search_space, 
                model_type='GP', initial_design_numdata=num_random_points, 
                initial_design_type='random', acquisition_type='EI', 
                normalize_Y=True, exact_feval=False, 
                acquisition_optimizer_type='lbfgs', 
                model_update_interval=1, evaluator_type='sequential', 
                batch_size=1, num_cores=os.cpu_count(), verbosity=True, 
                verbosity_model=False, maximize=maximize_bool, de_duplication=True)

We are using Gaussian processes, with expected improvement acquisition function and the limited memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs) optimization algorithm. See https://gpyopt.readthedocs.io/en/latest/GPyOpt.methods.html for information about all of the parameters as well as how to use models other than Gaussian processes.

In the next step, we set up log files and run the optimization by specifying the number of trials.

# text file with parameter info, gp info, best results
rf = results_dir + 'report'

# text file with header: Iteration Y var_1 var_2 etc.
ef = results_dir + 'evaluation'

# text file with gp model info
mf = results_dir + 'models'
gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, 
                           evaluations_file=ef, models_file=mf)

Finally, we create a Pandas DataFrame to store hyperparameter values, loss, model name, etc. for each trial. Note that integer values are rendered as reals.

header_params = []
for param in search_space:
    header_params.append(param['name'])
    
df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)

# gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0)
df_results['validation_loss'] = gpyopt_bo.Y
df_results['epochs'] = list_early_stop_epochs
df_results['validation loss check'] = list_validation_loss
df_results['model_name'] = list_saved_model_name
df_results.to_pickle(results_dir + 'df_results.pkl')
df_results.to_csv(results_dir + 'df_results.csv')

3. Best Result and Ensembling

To obtain an ensemble, we load the saved models, sort by ‘validation_loss’, and apply a threshold so that only the top 10 models are used. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.

To obtain the single best result, we sorted the DataFrame mentioned in the previous section to obtain the hyperparameters that yielded the best result.

# using validation loss thus minimization
threshold_multiplier = 1.0 
df_params.sort_values(by='validation_loss', ascending=True, inplace=True)

# apply threshold
accepted_models = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    bool1 = df_params.loc[df_params.index[i],'validation_loss'] < threshold_multiplier*threshold
    bool2 = accepted_models < accepted_models_num
    if bool1 and bool2:
        
        model_name = df_params.loc[df_params.index[i],'model_name']
        model_final = load_model(models_directory + model_name + '.h5')
        
        # these are class probabilities
        list_predicted_prob.append(model_final.predict(xprod))
        
        # best results
        if i == 0:
            y_predicted_class_best = np.argmax(model_final.predict(xprod), axis=1)
            
            name_result = ml_name + '_best'
            classification_analysis.classification_analysis(
            yprod, y_predicted_class_best, list_class_names,
            save_directory, name_result)
        
        accepted_models = accepted_models + 1

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.

num_cnn_blocksnum_filterskernel_sizenum_dense_nodesdense_nodes_divisorbatch_sizedrop_outvalidation_loss
46441922640.05000.0293
464419281280.05000.0299
464416081280.05000.0328
432419281280.05000.0342
41642562960.05000.0344
432425641280.05000.0348
46441288960.05000.0348
36441922320.41950.0349
46441608960.11960.0354
448425681280.05000.0359

Best results:
balanced accuracy score = 0.9941
accuracy score = 0.9942



Ensemble results:
balanced accuracy score = 0.9945
accuracy score = 0.9946