Hyperparameter Search With GPyOpt: Part 2 – XGBoost Classification and Ensembling

1. GPyOpt Python Package

GPyOpt is a Python open-source library for Bayesian Optimization developed by the Machine Learning group of the University of Sheffield. It is based on GPy, a Python framework for Gaussian process modelling.

In this article, we demonstrate how to use this package to perform hyperparameter search for a classification problem with XGBoost.

2. Using GPyOpt

Below are code fragments showing how to integrate the package with XGBoost.

We begin by specifying if the problem is one of minimization or maximization. Here, we are again using the MNIST handwritten digits data set for our classification problem. Thus we wish to minimize the multiclass log loss.

if error_metric == 'mlogloss':
    maximize_bool = False
else:
    raise NameError

Then we set the search space as a single element list of dictionaries in which we label the variable names, set variable types to continuous or discrete, and specify their range of values.

search_space = [
        {'name':'eta', 'type':'continuous', 'domain':(0.4, 0.8)},
        {'name':'max_depth', 'type':'continuous', 'domain':(3, 21)},
        {'name':'colsample_bytree', 'type':'continuous', 'domain':(0.5, 0.99)},
        {'name':'reg_lambda', 'type':'discrete', 'domain':(1,2,3,4,5,6)},
        {'name':'reg_alpha', 'type':'discrete', 'domain':(0,1,2,3)},
        {'name':'subsample', 'type':'continuous', 'domain':(0.6, 0.9)},
        {'name':'max_bin', 'type':'discrete',
         'domain':(50,75,100,125,150,200,250)}]

We also set fixed parameters required for XGBoost.

dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                          'objective':'multi:softprob',
                          'eval_metric':error_metric,  # 'mlogloss',
                          'num_class':number_of_classes}

maximum_boosting_rounds = 1500
early_stop_rounds = 10

Now we create our XGBoost model.

def xgb_function(params):
    
    dict_parameters = {'eta':params[0][0], 'max_depth':int(params[0][1]), 
    'colsample_bytree':params[0][2], 'reg_lambda':int(params[0][3]), 
    'reg_alpha':int(params[0][4]), 'subsample':params[0][5], 
    'max_bin':int(params[0][6])}
    
    dict_parameters.update(dict_single_parameters)        
    
    watchlist = [(dvalid, 'eval')]  
    
    xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist,
                          num_boost_round=maximum_boosting_rounds,
                          early_stopping_rounds=early_stop_rounds,
                          verbose_eval=False)
    
    model_name = 'xgb_gpyopt_' + str(np.random.uniform(0,1,))[2:9]
    multiclass_log_loss = xgb_model.best_score
    boosting_rounds = xgb_model.best_ntree_limit
    
    print('\nmodel name:',model_name)
    print('multiclass_log_loss =',multiclass_log_loss)
    print('boosting_rounds = ',boosting_rounds)
    
    list_boosting_rounds.append(boosting_rounds)
    list_model_name.append(model_name)
    
    # save model
    if xgb_model.best_ntree_limit <= maximum_boosting_rounds:
        joblib.dump(xgb_model, results_dir + model_name + '.joblib')
    
    return multiclass_log_loss

params is the search space list of dictionaries. Note that the second element must match the order specified in search_space.

We create our Bayesian optimization model

gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=xgb_function, domain=search_space, 
                model_type='GP', initial_design_numdata=num_random_points, 
                initial_design_type='random', acquisition_type='EI', 
                normalize_Y=True, exact_feval=False, 
                acquisition_optimizer_type='lbfgs', 
                model_update_interval=1, evaluator_type='sequential', 
                batch_size=1, num_cores=os.cpu_count(), verbosity=True, 
                verbosity_model=False, maximize=maximize_bool, de_duplication=True)

We are using Gaussian processes, with expected improvement acquisition function and the limited memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs) optimization algorithm. See https://gpyopt.readthedocs.io/en/latest/GPyOpt.methods.html for information about all of the parameters as well as how to use models other than Gaussian processes.

In the next step, we set up log files and run the optimization by specifying the number of trials.

# text file with parameter info, gp info, best results
rf = results_dir + 'report'

# text file with header: Iteration Y var_1 var_2 etc.
ef = results_dir + 'evaluation'

# text file with gp model info
mf = results_dir + 'models'
gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, 
                           evaluations_file=ef, models_file=mf)

Finally, we create a Pandas DataFrame to store the hyperparameter values, loss, boosting rounds, and model name for each trial. Note that integer values are rendered as reals.

header_params = []
for param in search_space:
    header_params.append(param['name'])
    
df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)

# integer values are rendered as real (1 -> 1.0)
df_results[error_metric] = gpyopt_bo.Y
df_results['boosting rounds'] = list_boosting_rounds
df_results['model name'] = list_model_name

3. Best Result and Ensembling

To obtain an ensemble, we load the saved models, sort by multiclass log loss, and apply a threshold so that only the top 5 models are used. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.

To obtain the single best result, we sorted the DataFrame mentioned in the previous section to obtain the hyperparameters that yielded the best result.

if type_error == 'mlogloss':
    threshold_multiplier = 1.0 
    df_params.sort_values(by=type_error, ascending=True, inplace=True)
else:
    raise NameError
 
# apply threshold
accepted_models = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    bool1 = df_params.loc[df_params.index[i],type_error] < threshold_multiplier*threshold
    bool2 = accepted_models < accepted_models_num
    if bool1 and bool2:            
        
        model_name = str(df_params.loc[df_params.index[i],'model name'])
         
        try:
            xgb_model = joblib.load(models_directory + model_name + '.joblib')
        except:
            print('\ncould not read', model_name)
        else:
            # these are class probabilities
            list_predicted_prob.append(xgb_model.predict(dprod))
            
            # best result
            if i == 0:
                y_predicted_class_best = np.argmax(xgb_model.predict(dprod), axis=1)
                
                name_result = ml_name + '_best'
                classification_analysis.classification_analysis(
                yprod, y_predicted_class_best, list_class_names,
                save_directory, name_result)
                            
            accepted_models = accepted_models + 1

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.

eta	max_depth	colsample_bytree	reg_lambda	subsample	max_bin	mlogloss	boosting rounds
0.4000	13.5529	0.5000	6	0.6000	75	0.078332	270
0.4000	13.0877	0.5000	6	0.6000	75	0.078332	270
0.4000	11.7554	0.9900	6	0.6000	75	0.078637	292
0.4567	14.9832	0.7403	6	0.6996	75	0.078739	191
0.4448	16.4082	0.6652	6	0.6976	75	0.079514	208

Best results:
balanced accuracy score = 0.9759
accuracy score = 0.9761

Ensemble results:
balanced accuracy score = 0.978
accuracy score = 0.9782

5. Code

import os
import numpy as np
import pandas as pd
import GPyOpt
import xgboost as xgb
from sklearn.model_selection import train_test_split
import pickle
import joblib
import time
from pathlib import Path
    
# *******************************************************************
        
def create_save_models(dtrain, dvalid, error_metric, 
                       num_random_points, num_iterations, 
                       results_dir, processor_type, number_of_classes):
    
    dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                              'objective':'multi:softprob',
                              'eval_metric':error_metric,
                              'num_class':number_of_classes}
    
    if processor_type == 'cpu':
        tree_method  = 'hist'
        predictor = 'auto'
    elif processor_type == 'gpu':
        tree_method  = 'gpu_hist'
        predictor = 'gpu_predictor'
    else:
        print('\ninvalid processor_type in create_models():',processor_type)
        raise NameError
        
    dict_single_parameters['tree_method'] = tree_method
    dict_single_parameters['predictor'] = predictor
    
    # save to results dir
    dict_file = open(results_dir + 'dict_single_parameters.pkl','wb')
    pickle.dump(dict_single_parameters, dict_file)
       
    maximum_boosting_rounds = 1500
    early_stop_rounds = 10
    list_boosting_rounds = []
    list_model_name = []
    
    # create search space
    search_space = [
            {'name':'eta', 'type':'continuous', 'domain':(0.4, 0.8)},
            {'name':'max_depth', 'type':'continuous', 'domain':(3, 21)},
            {'name':'colsample_bytree', 'type':'continuous', 'domain':(0.5, 0.99)},
            {'name':'reg_lambda', 'type':'discrete', 'domain':(1,2,3,4,5,6)},
            {'name':'reg_alpha', 'type':'discrete', 'domain':(0,1,2,3)},
            {'name':'subsample', 'type':'continuous', 'domain':(0.6, 0.9)},
            {'name':'max_bin', 'type':'discrete',
             'domain':(50,75,100,125,150,200,250)}]
            
    if error_metric == 'mlogloss':
        maximize_bool = False
    else:
        raise NameError
        
    start_time_total = time.time()
    
    def xgb_function(params):
        
        dict_parameters = {'eta':params[0][0], 'max_depth':int(params[0][1]), 
        'colsample_bytree':params[0][2], 'reg_lambda':int(params[0][3]), 
        'reg_alpha':int(params[0][4]), 'subsample':params[0][5], 
        'max_bin':int(params[0][6])}
        
        dict_parameters.update(dict_single_parameters)        
        
        watchlist = [(dvalid, 'eval')]  
        
        xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist,
                              num_boost_round=maximum_boosting_rounds,
                              early_stopping_rounds=early_stop_rounds,
                              verbose_eval=False)
        
        model_name = 'xgb_gpyopt_' + str(np.random.uniform(0,1,))[2:9]
        multiclass_log_loss = xgb_model.best_score
        boosting_rounds = xgb_model.best_ntree_limit
        
        print('\nmodel name:',model_name)
        print('multiclass_log_loss =',multiclass_log_loss)
        print('boosting_rounds = ',boosting_rounds)
        
        list_boosting_rounds.append(boosting_rounds)
        list_model_name.append(model_name)
        
        # save model
        if xgb_model.best_ntree_limit <= maximum_boosting_rounds:
            joblib.dump(xgb_model, results_dir + model_name + '.joblib')
        
        return multiclass_log_loss
    
    gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=xgb_function, domain=search_space, 
                model_type='GP', initial_design_numdata=num_random_points, 
                initial_design_type='random', acquisition_type='EI', 
                normalize_Y=True, exact_feval=False, 
                acquisition_optimizer_type='lbfgs', 
                model_update_interval=1, evaluator_type='sequential', 
                batch_size=1, num_cores=os.cpu_count(), verbosity=False, 
                verbosity_model=False, maximize=maximize_bool, de_duplication=True)
    
    # text file with parameter info, gp info, best results
    rf = results_dir + 'report'
    
    # text file with header: Iteration	Y	var_1	var_2	etc.
    ef = results_dir + 'evaluation'
    
    # text file with gp model info
    mf = results_dir + 'models'
    gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, 
                               evaluations_file=ef, models_file=mf)
    
    elapsed_time_total = (time.time()-start_time_total)/60
    print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
               
    # put results in dfs
    header_params = []
    for param in search_space:
        header_params.append(param['name'])
        
    df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)
    
    # integer values are rendered as real (1 -> 1.0)
    df_results[error_metric] = gpyopt_bo.Y
    df_results['boosting rounds'] = list_boosting_rounds
    df_results['model name'] = list_model_name
    df_results.to_pickle(results_dir + 'df_results.pkl')
    df_results.to_csv(results_dir + 'df_results.csv')
               
# end of create_save_models()
    
# *******************************************************************
            
def make_final_predictions(dcalib, dprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, ml_name, dict_single_params,
                           type_error, accepted_models_num):
    
    if type_error == 'mlogloss':
        threshold_multiplier = 1.0 
        df_params.sort_values(by=type_error, ascending=True, inplace=True)
    else:
        raise NameError
    
    # apply threshold
    accepted_models = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        bool1 = df_params.loc[df_params.index[i],type_error] < threshold_multiplier*threshold
        bool2 = accepted_models < accepted_models_num
        if bool1 and bool2:            
            
            model_name = str(df_params.loc[df_params.index[i],'model name'])
             
            try:
                xgb_model = joblib.load(models_directory + model_name + '.joblib')
            except:
                print('\ncould not read', model_name)
            else:
                # these are class probabilities
                list_predicted_prob.append(xgb_model.predict(dprod))
                
                # best result
                if i == 0:
                    y_predicted_class_best = np.argmax(xgb_model.predict(dprod), axis=1)
                    
                    name_result = ml_name + '_best'
                    classification_analysis.classification_analysis(
                    yprod, y_predicted_class_best, list_class_names,
                    save_directory, name_result)
                                    
                accepted_models = accepted_models + 1

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)
    
# end of make_final_predictions()
       
# *******************************************************************

if __name__ == '__main__':    
    
    type_of_processor = 'gpu'
           
    ml_algorithm_name = 'xgb'
    file_name_stub = 'gpyopt_' + ml_algorithm_name
     
    calculation_type = 'calibration' #'calibration' 'production'
     
    base_directory = YOUR DIRECTORY
    data_directory = YOUR DIRECTORY
     
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
               
    # fixed parameters
    error_type = 'mlogloss'
    threshold_error = 0.09
    total_number_of_iterations = 25
    number_of_random_points = 5  # random searches to start opt process
    # this is # of bayes iters, thus total=this + # of random pts
    number_of_iterations =  total_number_of_iterations - number_of_random_points
    save_models = False
    num_models_to_accept = 5
          
    # read data
    x_calib = np.load(data_directory + 'x_mnist_calibration.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration.npy')
    d_calib = xgb.DMatrix(x_calib, label=y_calib)
    num_classes = np.unique(y_calib).shape[0]
    
    x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib,
    train_size=0.75, shuffle=True, stratify=y_calib)
    
    # transform numpy arrays into dmatrix format
    d_train = xgb.DMatrix(x_train, label=y_train)
    d_valid = xgb.DMatrix(x_valid, label=y_valid)
                
    print('\n*** starting at',pd.Timestamp.now())

    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
        
        create_save_models(d_train, d_valid, error_type,
                           number_of_random_points, number_of_iterations,
                           results_directory, type_of_processor, num_classes)
                          
    elif calculation_type == 'production':
        
        # get etrees parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_results.pkl')
        
        dict_file = open(models_dir + 'dict_single_parameters.pkl','rb')
        dictionary_single_parameters = pickle.load(dict_file)
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
    
        # transform numpy arrays into dmatrix format
        d_prod = xgb.DMatrix(x_prod, label=y_prod)
        d_calib = xgb.DMatrix(x_calib, label=y_calib)
                
        make_final_predictions(d_calib, d_prod, y_prod, class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, ml_algorithm_name, 
                               dictionary_single_parameters,
                               error_type, num_models_to_accept)
               
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError