Hyperparameter Search With GPyOpt: Part 1 – Scikit-learn Classification and Ensembling

1. GPyOpt Python Package

GPyOpt is a Python open-source library for Bayesian Optimization developed by the Machine Learning group of the University of Sheffield. It is based on GPy, a Python framework for Gaussian process modelling.

In this article, we demonstrate how to use this package to do hyperparameter search for a classification problem with Scikit-learn.

2. Using GPyOpt

Below are code fragments showing how to integrate the package with Scikit-learn.

We begin by specifying if the problem is one of minimization or maximization. Here, we are again using the ExtraTreesClassifier for the MNIST handwritten digits data set so we wish to maximize the balanced accuracy.

if error_metric in ['balanced_accuracy', 'accuracy']:
        maximize_bool = True
else:
    raise NameError

Then we set the search space as a single element list of dictionaries in which we label the variable names, set variable types to continuous or discrete, and specify their range of values.

search_space = [
            {'name':'max_features', 'type':'continuous', 'domain':(0.15, 1.0)},
            {'name':'max_samples', 'type':'continuous', 'domain':(0.6, 0.99)},
            {'name':'min_samples_split', 'type':'continuous', 'domain':(2, 14.49)},
            {'name':'min_samples_leaf', 'type':'continuous', 'domain':(1, 14.49)},
            {'name':'n_estimators', 'type':'discrete', 
             'domain':(25, 50, 75, 100, 125, 150, 175, 200, 225, 250)},
            {'name':'criterion', 'type':'discrete', 'domain':(1, 2)}]

Now we create our Sci-kit learn model.

def etrees_crossval(params):
        dict_criterion={1:'gini', 2:'entropy'}
        etrees = ExtraTreesClassifier(
                 max_features=params[0][0], 
                 max_samples=params[0][1],
                 min_samples_split=int(params[0][2]),
                 min_samples_leaf=int(params[0][3]), 
                 n_estimators=int(params[0][4]),
                 criterion=dict_criterion[int(params[0][5])],
                 bootstrap=True, n_jobs=-1, verbose=0)
         
        mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, 
                                        cv=cv_folds, n_jobs=-1).mean()
         
        return mean_cv_score

params is the search space list of dictionaries. Note that the second element must match the order specified in search_space.

We create our Bayesian optimization model

gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=etrees_crossval, domain=search_space, 
                model_type='GP', initial_design_numdata=num_random_points, 
                initial_design_type='random', acquisition_type='EI', 
                normalize_Y=True, exact_feval=False, 
                acquisition_optimizer_type='lbfgs', 
                model_update_interval=1, evaluator_type='sequential', 
                batch_size=1, num_cores=os.cpu_count(), verbosity=True, 
                verbosity_model=False, maximize=maximize_bool, de_duplication=True)

We are using Gaussian processes, with expected improvement acquisition function and the limited memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs) optimization algorithm. See https://gpyopt.readthedocs.io/en/latest/GPyOpt.methods.html for information about all of the parameters as well as how to use models other than Gaussian processes.

In the next step, we set up log files and run the optimization by specifying the number of trials.

# text file with parameter info, gp info, best results
rf = results_dir + 'report'

# text file with header: Iteration Y var_1 var_2 etc.
ef = results_dir + 'evaluation'

# text file with gp model info
mf = results_dir + 'models'
gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, 
                           evaluations_file=ef, models_file=mf)

Finally, we create a Pandas DataFrame to store the hyperparameter values and cross validated balanced accuracy for each trial. Note the negative in front of the evaluation result returned by GPyOpt and that integer values are rendered as reals.

header_params = []
for param in search_space:
    header_params.append(param['name'])
    
df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)

# gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0)
df_results[error_metric] = -gpyopt_bo.Y

3. Best Result and Ensembling

To obtain an ensemble, we apply a threshold on the cross validation error (balanced accuracy) and then refit ExtraTreesClassifier on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting. We also limit the number of models used to build the ensemble.

To obtain the single best result, we sorted the DataFrame mentioned in the previous section to obtain the hyperparameters that yielded the best result.

if type_error in ['balanced_accuracy', 'accuracy']:
    threshold_multiplier = 1.0 
    df_params.sort_values(by=type_error, ascending=False, inplace=True)
else:
    raise NameError
 
# apply threshold
dict_criterion={1:'gini', 2:'entropy'}
accepted_models = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    bool1 = df_params.loc[df_params.index[i],type_error] > threshold_multiplier*threshold
    bool2 = accepted_models < accepted_models_num
    if bool1 and bool2:
        
        criterion_int = int(df_params.loc[df_params.index[i],'criterion'])
        
        ml_model = ExtraTreesClassifier(
        n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
        max_features=df_params.loc[df_params.index[i],'max_features'], 
        min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
        min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), 
        max_samples=df_params.loc[df_params.index[i],'max_samples'],
        criterion= dict_criterion[criterion_int],
        bootstrap=True, n_jobs=-1, verbose=0)
         
        ml_model.fit(xcalib, ycalib)
         
        list_predicted_prob.append(ml_model.predict_proba(xprod))
        
        # best result
        if i == 0:
            y_predicted_class_best = ml_model.predict(xprod)
            
            name_result = ml_name + '_best'
            classification_analysis.classification_analysis(
            yprod, y_predicted_class_best, list_class_names,
            save_directory, name_result)
         
        accepted_models = accepted_models + 1
         
        if save_models_flag:
            model_name = ml_name + str(df_params.index[i]) + '_joblib.sav'
            dump(ml_model, save_directory + model_name)
 
print('\nthere were',accepted_models,' accepted_models\n')

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)
 
# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.

max_features	max_samples	min_samples_split	min_samples_leaf	n_estimators	criterion	balanced_accuracy
0.150	0.990	2.000	1.000	225	1	0.9520
0.150	0.990	2.000	1.000	250	1	0.9515
0.150	0.990	2.000	1.000	175	1	0.9515
0.150	0.990	2.000	1.000	200	1	0.9514
0.150	0.990	2.005	1.000	225	1	0.9507
0.169	0.989	2.709	1.174	175	1	0.9506
0.192	0.944	3.878	1.985	250	1	0.9505
0.167	0.945	4.337	1.172	225	1	0.9505
0.150	0.990	2.000	1.000	225	2	0.9503
0.350	0.970	3.918	1.096	200	1	0.9499

Best results:
balanced accuracy score = 0.9607
accuracy score = 0.961

Ensemble results:
balanced accuracy score = 0.961
accuracy score = 0.9613

5. Code

import os
import numpy as np
import pandas as pd
import GPyOpt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from joblib import dump
import time
from pathlib import Path

# *******************************************************************
     
def create_save_models(x, y, error_metric, cv_folds,
    num_random_points, num_iterations, results_dir):
    
    if error_metric in ['balanced_accuracy', 'accuracy']:
        maximize_bool = True
    else:
        raise NameError
     
    start_time_total = time.time()
    
    search_space = [
            {'name':'max_features', 'type':'continuous', 'domain':(0.15, 1.0)},
            {'name':'max_samples', 'type':'continuous', 'domain':(0.6, 0.99)},
            {'name':'min_samples_split', 'type':'continuous', 'domain':(2, 14.49)},
            {'name':'min_samples_leaf', 'type':'continuous', 'domain':(1, 14.49)},
            {'name':'n_estimators', 'type':'discrete', 
             'domain':(25, 50, 75, 100, 125, 150, 175, 200, 225, 250)},
            {'name':'criterion', 'type':'discrete', 'domain':(1, 2)}]
        
    # create sklearn model
    def etrees_crossval(params):
        dict_criterion={1:'gini', 2:'entropy'}
        etrees = ExtraTreesClassifier(
                 max_features=params[0][0], 
                 max_samples=params[0][1],
                 min_samples_split=int(params[0][2]),
                 min_samples_leaf=int(params[0][3]), 
                 n_estimators=int(params[0][4]),
                 criterion=dict_criterion[int(params[0][5])],
                 bootstrap=True, n_jobs=-1, verbose=0)
         
        mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, 
                                        cv=cv_folds, n_jobs=-1).mean()
         
        return mean_cv_score
    
    gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=etrees_crossval, domain=search_space, 
                model_type='GP', initial_design_numdata=num_random_points, 
                initial_design_type='random', acquisition_type='EI', 
                normalize_Y=True, exact_feval=False, 
                acquisition_optimizer_type='lbfgs', 
                model_update_interval=1, evaluator_type='sequential', 
                batch_size=1, num_cores=os.cpu_count(), verbosity=True, 
                verbosity_model=False, maximize=maximize_bool, de_duplication=True)
    
    # text file with parameter info, gp info, best results
    rf = results_dir + 'report'
    
    # text file with header: Iteration	Y var_1 var_2 etc.
    ef = results_dir + 'evaluation'
    
    # text file with gp model info
    mf = results_dir + 'models'
    gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, 
                               evaluations_file=ef, models_file=mf)
        
        
    # put results in dfs
    header_params = []
    for param in search_space:
        header_params.append(param['name'])
        
    df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)
    
    # gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0)
    df_results[error_metric] = -gpyopt_bo.Y
    df_results.to_pickle(results_dir + 'df_results.pkl')
    df_results.to_csv(results_dir + 'df_results.csv')
    
     
    elapsed_time_total = (time.time()-start_time_total)/60
    print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
     
         
# end of create_save_models()
         
# *******************************************************************

def make_final_predictions(xcalib, ycalib, xprod, yprod,
                           save_directory, save_models_flag, df_params,
                           threshold, type_error, accepted_models_num,
                           ml_name, list_class_names):
    
    if type_error in ['balanced_accuracy', 'accuracy']:
        threshold_multiplier = 1.0 
        df_params.sort_values(by=type_error, ascending=False, inplace=True)
    else:
        raise NameError
     
    # apply threshold
    dict_criterion={1:'gini', 2:'entropy'}
    accepted_models = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        bool1 = df_params.loc[df_params.index[i],type_error] > threshold_multiplier*threshold
        bool2 = accepted_models < accepted_models_num
        if bool1 and bool2:
            
            criterion_int = int(df_params.loc[df_params.index[i],'criterion'])
            
            ml_model = ExtraTreesClassifier(
            n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
            max_features=df_params.loc[df_params.index[i],'max_features'], 
            min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
            min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), 
            max_samples=df_params.loc[df_params.index[i],'max_samples'],
            criterion= dict_criterion[criterion_int],
            bootstrap=True, n_jobs=-1, verbose=0)
             
            ml_model.fit(xcalib, ycalib)
             
            list_predicted_prob.append(ml_model.predict_proba(xprod))
            
            # best results
            if i == 0:
                y_predicted_class_best = ml_model.predict(xprod)
                
                name_result = ml_name + '_best'
                classification_analysis.classification_analysis(
                yprod, y_predicted_class_best, list_class_names,
                save_directory, name_result)
             
            accepted_models = accepted_models + 1
             
            if save_models_flag:
                model_name = ml_name + str(df_params.index[i]) + '_joblib.sav'
                dump(ml_model, save_directory + model_name)
 
    print('\nthere were',accepted_models,' accepted_models\n')
    
    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
     
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)
    
# end of make_final_predictions()
    
# *******************************************************************
    
if __name__ == '__main__':    
     
    ml_algorithm_name = 'etrees'
    file_name_stub = 'gpyopt_' + ml_algorithm_name
     
    calculation_type = 'production'
     
    base_directory = YOUR DIRECTORY
    data_directory = YOUR DIRECTORY
     
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
                 
    # fixed parameters
    error_type = 'balanced_accuracy'
    threshold_error = 0.93
    cross_valid_folds = 3
    total_number_of_iterations = 100
    number_of_random_points = 20 # random searches to start opt process
    number_of_iterations =  total_number_of_iterations - number_of_random_points
    save_models = False
    num_models_to_accept = 10
            
    # use small data set
    x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
 
    # 1 - calibration
    if calculation_type == 'calibration':
         
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
         
        create_save_models(x_calib, y_calib, error_type, 
                                     cross_valid_folds, 
                                     number_of_random_points, number_of_iterations,
                                     results_directory) 
    # 2 - production - apply threshold
    elif calculation_type == 'production':        
        # get etrees parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_results.pkl')
         
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
             
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
         
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                 
        make_final_predictions(x_calib, y_calib, x_prod, y_prod,
                               results_directory, save_models, df_parameters, 
                               threshold_error, error_type, num_models_to_accept,
                               ml_algorithm_name, class_names_list)                
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError