Hyperparameter Search With Bayesian Optimization for XGBoost Classification and Ensembling

1. Introduction

In Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling we applied the Bayesian Optimization (BO) package to the Scikit-learn ExtraTreesClassifier algorithm. Here we do the same for XGBoost. As we are using the non Scikit-learn version of XGBoost, there are some modification required from the previous code as opposed to a straightforward drop in for algorithm specific parameters. However, once done, we can access the full power of XGBoost running on GPUs with an efficient hyperparmeter search method.

2. Using Bayesian Optimization

Below is a code fragment showing how to integrate the package with XGBoost.

# xgb single params
dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                          'objective':'multi:softprob',
                          'eval_metric':error_metric,  # 'mlogloss',
                          'num_class':number_of_classes}

if processor_type == 'cpu':
    tree_method  = 'hist' # auto
    predictor = 'auto'  # 'cpu_predictor' , default = auto
elif processor_type == 'gpu':
    tree_method  = 'gpu_hist'
    predictor = 'gpu_predictor' # default = auto
else:
    print('\ninvalid processor_type in create_models():',processor_type)
    raise NameError
    
dict_single_parameters['tree_method'] = tree_method
dict_single_parameters['predictor'] = predictor

# save to results dir
dict_file = open(results_dir + 'dict_single_parameters.pkl','wb')
pickle.dump(dict_single_parameters, dict_file)

maximum_boosting_rounds = 1500
early_stop_rounds = 10
list_boosting_rounds = []

start_time_total = time.time()

def xgb_function(eta, max_depth, colsample_bytree, reg_lambda, reg_alpha,
                 subsample, max_bin):
    
    dict_parameters = {'eta':eta, 'max_depth':int(max_depth), 
    'colsample_bytree':colsample_bytree, 'reg_lambda':int(reg_lambda), 
    'reg_alpha':int(reg_alpha), 'subsample':subsample, 'max_bin':int(max_bin)}
    
    dict_parameters.update(dict_single_parameters)        
    
    watchlist = [(dvalid, 'eval')]
    xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist,
                          num_boost_round=maximum_boosting_rounds,
                          early_stopping_rounds=early_stop_rounds,
                          verbose_eval=False)
    
    multiclass_log_loss = xgb_model.best_score
    boosting_rounds = xgb_model.best_ntree_limit
    
    # record boosting_rounds here, added to bayes opt parameter df below
    list_boosting_rounds.append(boosting_rounds)
    
    # bayes opt is a maximization algorithm, to minimize mlogloss, return 1-this
    bayes_opt_score = 1.0 - multiclass_log_loss
    
    return bayes_opt_score

optimizer = BayesianOptimization(f=xgb_function,
            pbounds={'eta': (0.1, 0.8),
                     'max_depth': (3, 21),
                     'colsample_bytree': (0.5, 0.99),
                     'reg_lambda': (1, 6),
                     'reg_alpha': (0, 3),
                     'subsample': (0.6, 0.9),
                     'max_bin': (32, 257)},
                     verbose=2)

optimizer.maximize(init_points=num_random_points, 
                   n_iter=num_iterations)

print('\nbest result:', optimizer.max)

Calling the train function for XGBoost requires various parameters. We must determine which parameters will be exposed to BO, which will be fixed, and of these latter, which will be required later for creating a model with all calibration data. Thus we create a dictionary of fixed parameters and save it for later use. Variable parameters are created in a dictionary which is used in the train function call. Note that BO only supports real valued parameters, so we must convert to integers those parameters that are integer valued. Also, we note that BO will only see those parameters that appear in the pbounds dictionary.

As this is a multiclass problem, we use the mlogloss (multiclass logarithmic loss) evaluation metric. Since lower is better and BO is a maximization algorithm, we record the BO score as 1 – mlogloss. Additionally, since we want to use ensembling by first taking the mean of predicted probabilities and then obtaining a definite class prediction via plurality voting, we set the XGBoost objective to multi:softprob instead of multi:softmax. When we call predict, we get the predicted class probabilities. If we were to use multi:softmax, we would get definite classes.

Because of our use of early stopping, we must record the actual number of boosting rounds used for each set of BO parameters. We do this inside of an implicit loop and accumulate results in a list. Later, we will save this with the other BO parameters.

See Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling for an explanation of the other BO parameters.

To collect all results from BO, we have

list_dfs = []
counter = 0
for result in optimizer.res:
    df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                     columns=['trial' + str(counter)]).T
    df_temp['bayes opt error'] = result['target']
    
    df_temp['boosting rounds'] = list_boosting_rounds[counter]
    
    list_dfs.append(df_temp)
    
    counter = counter + 1
    
df_results = pd.concat(list_dfs, axis=0)
df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')

optimizer.res is a list of dictionaries. The code above puts each dictionary into a DataFrame then stacks them vertically into a single DataFrame. optimizer.max yields the best result (maximum target), it can also be obtained from the single DataFrame. Note that we save the actual number of boosting rounds.

3. Ensembling

To obtain an ensemble, we apply a threshold on the BO error (1 – multiclass log loss), then refit XGBoost on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.

accepted_models_num = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    if df_params.loc[df_params.index[i],'bayes opt error'] > threshold:
        
        dict_temp = {'eta':df_params.loc[df_params.index[i],'eta'], 
        'max_depth':int(df_params.loc[df_params.index[i],'max_depth']), 
        'colsample_bytree':df_params.loc[df_params.index[i],'colsample_bytree'], 
        'reg_lambda':int(df_params.loc[df_params.index[i],'reg_lambda']), 
        'reg_alpha':int(df_params.loc[df_params.index[i],'reg_alpha']), 
        'subsample':df_params.loc[df_params.index[i],'subsample'], 
        'max_bin':int(df_params.loc[df_params.index[i],'max_bin'])}
        
        dict_temp.update(dict_single_params)       
                
        ml_model = xgb.train(params=dict_temp, dtrain=dcalib,
                   num_boost_round=df_params.loc[df_params.index[i],'boosting rounds'],
                   verbose_eval=False)
        
        # these are class probabilities
        list_predicted_prob.append(ml_model.predict(dprod))
        
        accepted_models_num = accepted_models_num + 1
        
        if save_models_flag:
            model_name = ml_name + df_params.index[i] + '.joblib'
            joblib.dump(ml_model, save_directory + model_name)

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class = np.argmax(mean_probabilities, axis=1)

Note that the fixed XGBoost parameters were read from the saved dictionary mentioned in the previous section. Also, we must remember to set integer values from real values for some parameters.

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we set the threshold to a value such that only the top 3 models were used in the ensemble.

XGBoost Hyperparameters

	colsample_bytree	eta	max_bin	max_depth	reg_alpha	reg_lambda	subsample	bayes opt error	boosting rounds
trial32	0.5036	0.1143	138.7366	4.3728	0.1127	1.1435	0.7542	0.93	585
trial25	0.6243	0.1208	165.0408	4.1827	0.3094	1.1087	0.756	0.9287	556
trial28	0.5204	0.1223	219.8433	4.4607	0.1138	5.9987	0.6937	0.9285	694
trial33	0.5078	0.108	222.0052	9.6904	0.2431	5.7352	0.6508	0.9283	471
trial23	0.5015	0.1057	75.8412	20.199	0.1203	5.9917	0.7483	0.9279	634
trial24	0.8885	0.2177	34.6693	4.9973	0.0492	5.9971	0.6079	0.9268	437
trial29	0.5099	0.1047	114.9646	10.4737	0.3855	1.3832	0.6889	0.9267	427
trial34	0.5997	0.1039	120.2969	3.1413	0.6369	1.0811	0.6736	0.9255	1178
trial27	0.5061	0.1055	256.4603	5.183	2.8751	5.3508	0.8049	0.9254	708
trial6	0.9065	0.1948	128.575	4.8351	0.9786	4.8585	0.699	0.9253	356
trial35	0.5302	0.1475	34.4272	3.1057	0.3783	5.9389	0.8962	0.9251	976
trial39	0.8587	0.109	73.075	20.642	0.0219	2.0848	0.6047	0.9248	459
trial17	0.8886	0.1104	141.8841	10.2712	0.0787	5.5841	0.6529	0.9245	441
trial38	0.7377	0.1005	152.2219	3.0285	0.1002	1.6141	0.7522	0.9243	1013
trial30	0.6223	0.1247	72.3965	3.3041	0.1139	5.6578	0.7379	0.9242	981
trial31	0.9662	0.1057	106.3717	19.5672	0.3194	5.9989	0.8496	0.9242	582
trial19	0.8002	0.1065	255.5909	14.1506	0.0472	5.8918	0.8434	0.9241	408
trial37	0.9896	0.1246	166.5842	18.89	0.0126	5.9794	0.8777	0.9237	522
trial11	0.6187	0.1714	176.4134	3.0073	0.0358	5.7436	0.7478	0.9236	715
trial12	0.5094	0.3754	122.7707	19.9095	0.0608	5.9838	0.7708	0.9231	215
trial3	0.7484	0.3475	139.3771	5.6049	1.3504	1.917	0.8949	0.9229	232
trial14	0.928	0.2047	218.9513	20.9544	0.0678	5.9256	0.8397	0.9225	298
trial15	0.971	0.1959	214.0056	3.4067	0.2804	1.1663	0.7772	0.9221	567
trial26	0.5693	0.1577	149.9932	20.7862	0.0359	1.0385	0.8587	0.9208	298
trial4	0.5669	0.2315	239.9287	12.956	1.193	3.0651	0.6309	0.9207	239
trial8	0.6143	0.1976	152.8536	8.7521	2.6321	5.6361	0.6999	0.9207	542
trial20	0.5065	0.138	204.0469	12.5213	2.971	5.748	0.8745	0.9193	585
trial36	0.5045	0.1217	125.4944	13.3205	2.9554	1.3617	0.6119	0.919	561
trial22	0.7123	0.2928	32.2564	3.1269	2.7891	4.6379	0.7154	0.9171	618
trial0	0.9712	0.3908	239.9837	7.3847	1.8449	4.4724	0.6883	0.9158	152
trial21	0.8875	0.1933	95.2283	20.9996	2.9108	1.1088	0.7986	0.9154	413
trial7	0.5194	0.503	52.7982	6.2929	2.445	5.222	0.6559	0.9152	208
trial2	0.7217	0.2633	179.238	17.668	2.2636	2.0659	0.8744	0.9145	237
trial5	0.6014	0.559	51.8397	19.3168	1.9832	4.4672	0.7479	0.9111	206
trial16	0.5701	0.6233	256.0727	3.7648	0.1176	1.1933	0.8054	0.9101	196
trial13	0.6617	0.6395	100.4746	3.2355	0.0473	5.7955	0.8091	0.91	224
trial10	0.8724	0.3478	255.9409	20.8715	2.8753	1.3192	0.7469	0.9093	204
trial9	0.5434	0.6934	32.8279	18.1195	1.2451	5.5928	0.7382	0.9073	229
trial1	0.5144	0.641	42.4464	11.1021	0.7396	1.0685	0.6092	0.9027	82
trial18	0.517	0.7632	237.857	20.8503	0.0554	1.6942	0.6681	0.8928	113

balanced accuracy score = 0.9819
accuracy score = 0.982
number of accepted models = 3 for threshold = 0.9284

5. Code

import os
import numpy as np
import pandas as pd
from bayes_opt import BayesianOptimization
import xgboost as xgb
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import pickle
import time
from pathlib import Path
import sys
sys.path.append(YOUR DIRECTORY)
import confusion_matrix_plot
    
# *******************************************************************
        
def create_save_models_bayes_opt(dtrain, dvalid, error_metric, 
                                 num_random_points, num_iterations, 
                                 results_dir, processor_type,
                                 number_of_classes):
    # xgb single params
    dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                              'objective':'multi:softprob',
                              'eval_metric':error_metric,  # 'mlogloss',
                              'num_class':number_of_classes}
    
    if processor_type == 'cpu':
        tree_method  = 'hist' # auto
        predictor = 'auto'  # 'cpu_predictor' , default = auto
    elif processor_type == 'gpu':
        tree_method  = 'gpu_hist'
        predictor = 'gpu_predictor' # default = auto
    else:
        print('\ninvalid processor_type in create_models():',processor_type)
        raise NameError
        
    dict_single_parameters['tree_method'] = tree_method
    dict_single_parameters['predictor'] = predictor
    
    # save to results dir
    dict_file = open(results_dir + 'dict_single_parameters.pkl','wb')
    pickle.dump(dict_single_parameters, dict_file)
       
    maximum_boosting_rounds = 1500
    early_stop_rounds = 10
    list_boosting_rounds = []
    
    start_time_total = time.time()
    
    def xgb_function(eta, max_depth, colsample_bytree, reg_lambda, reg_alpha,
                     subsample, max_bin):
        
        dict_parameters = {'eta':eta, 'max_depth':int(max_depth), 
        'colsample_bytree':colsample_bytree, 'reg_lambda':int(reg_lambda), 
        'reg_alpha':int(reg_alpha), 'subsample':subsample, 'max_bin':int(max_bin)}
        
        dict_parameters.update(dict_single_parameters)        
        
        watchlist = [(dvalid, 'eval')]
        xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist,
                              num_boost_round=maximum_boosting_rounds,
                              early_stopping_rounds=early_stop_rounds,
                              verbose_eval=False)
        
        multiclass_log_loss = xgb_model.best_score
        boosting_rounds = xgb_model.best_ntree_limit
        
        # record boosting_rounds here, added to bayes opt parameter df below
        list_boosting_rounds.append(boosting_rounds)
        
        # bayes opt is a maximization algorithm, to minimize mlogloss, return 1-this
        bayes_opt_score = 1.0 - multiclass_log_loss
        
        return bayes_opt_score
        
    optimizer = BayesianOptimization(f=xgb_function,
                pbounds={'eta': (0.1, 0.8),
                         'max_depth': (3, 21),
                         'colsample_bytree': (0.5, 0.99),
                         'reg_lambda': (1, 6),
                         'reg_alpha': (0, 3),
                         'subsample': (0.6, 0.9),
                         'max_bin': (32, 257)},
                         verbose=2)
    
    optimizer.maximize(init_points=num_random_points, 
                       n_iter=num_iterations)
    
    print('\nbest result:', optimizer.max)
    
    elapsed_time_total = (time.time()-start_time_total)/60
    print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
    
    # optimizer.res is a list of dicts
    list_dfs = []
    counter = 0
    for result in optimizer.res:
        df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                         columns=['trial' + str(counter)]).T
        df_temp['bayes opt error'] = result['target']
        
        df_temp['boosting rounds'] = list_boosting_rounds[counter]
        
        list_dfs.append(df_temp)
        
        counter = counter + 1
        
    df_results = pd.concat(list_dfs, axis=0)
    df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
    df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')
        
# end of create_save_models_bayes_opt()
    
# *******************************************************************
            
def make_final_predictions(dcalib, dprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, ml_name, dict_single_params):
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        if df_params.loc[df_params.index[i],'bayes opt error'] > threshold:
            
            dict_temp = {'eta':df_params.loc[df_params.index[i],'eta'], 
            'max_depth':int(df_params.loc[df_params.index[i],'max_depth']), 
            'colsample_bytree':df_params.loc[df_params.index[i],'colsample_bytree'], 
            'reg_lambda':int(df_params.loc[df_params.index[i],'reg_lambda']), 
            'reg_alpha':int(df_params.loc[df_params.index[i],'reg_alpha']), 
            'subsample':df_params.loc[df_params.index[i],'subsample'], 
            'max_bin':int(df_params.loc[df_params.index[i],'max_bin'])}
            
            dict_temp.update(dict_single_params)       
            
            
            ml_model = xgb.train(params=dict_temp, dtrain=dcalib,
                       num_boost_round=df_params.loc[df_params.index[i],'boosting rounds'],
                       verbose_eval=False)
            
            # these are class probabilities
            list_predicted_prob.append(ml_model.predict(dprod))
            
            accepted_models_num = accepted_models_num + 1
            
            if save_models_flag:
                model_name = ml_name + df_params.index[i] + '.joblib'
                joblib.dump(ml_model, save_directory + model_name)

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)
    
    # compute and save error measures

    # print info to file
    stdout_default = sys.stdout
    sys.stdout = open(save_directory + ml_name + '_prediction_results.txt','w')
    
    print('balanced accuracy score =',balanced_accuracy_score(yprod, y_predicted_class))
    
    print('accuracy score =',accuracy_score(yprod, y_predicted_class))
    
    print('number of accepted models =',accepted_models_num,' for threshold =',threshold)
    
    print('\nclassification report:')
    print(classification_report(yprod, y_predicted_class, digits=3, output_dict=False))
    
    print('\nraw confusion matrix:')
    cm_raw = confusion_matrix(yprod, y_predicted_class)
    print(cm_raw)
    
    print('\nconfusion matrix normalized by prediction:')
    cm_pred = confusion_matrix(yprod, y_predicted_class, normalize='pred')
    print(cm_pred)
    
    print('\nconfusion matrix normalized by true:')
    cm_true = confusion_matrix(yprod, y_predicted_class, normalize='true')
    print(cm_true)
    
    sys.stdout = stdout_default
    
    # plot and save confustion matrices
    figure_size = (12, 8)
    number_of_decimals = 4
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_raw, 
    list_class_names, save_directory, 'Confusion Matrix', 
    ml_name + '_confusion_matrix', False, None, 30, figure_size,
    number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_pred, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Prediction', 
    ml_name + '_confusion_matrix_norm_by_prediction', False, 'pred', 
    30, figure_size, number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_true, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Actual', 
    ml_name + '_confusion_matrix_norm_by_true', False, 'true', 
    30, figure_size, number_of_decimals)

# end of make_final_predictions()
       
# *******************************************************************

if __name__ == '__main__':    
    
    type_of_processor = 'gpu'
    
    ml_algorithm_name = 'xgb'
    file_name_stub = ml_algorithm_name + '_bayes_opt'  
    
    calculation_type = 'production' #'calibration' 'production'
    
    data_directory = YOUR DIRECTORY
    
    base_directory = YOUR DIRECTORY
    
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
               
    # fixed parameters
    error_type = 'mlogloss'  # multi class log loss
    threshold_error = 0.9284  #this is 1 - mlogloss
    total_number_of_iterations = 40
    number_of_random_points = 10  # random searches to start opt process
    # this is # of bayes iters, thus total=this + # of random pts
    number_of_iterations =  total_number_of_iterations - number_of_random_points
    save_models = False
    
       
    # read data
    x_calib = np.load(data_directory + 'x_mnist_calibration.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration.npy')
    d_calib = xgb.DMatrix(x_calib, label=y_calib)
    num_classes = np.unique(y_calib).shape[0]
    
    x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib,
    train_size=0.75, shuffle=True, stratify=y_calib)
    
    # transform numpy arrays into dmatrix format
    d_train = xgb.DMatrix(x_train, label=y_train)
    d_valid = xgb.DMatrix(x_valid, label=y_valid)
                
    print('\n*** starting at',pd.Timestamp.now())

    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
        
        create_save_models_bayes_opt(d_train, d_valid, error_type,
                                     number_of_random_points, number_of_iterations,
                                     results_directory, type_of_processor,
                                     num_classes)
                          
    elif calculation_type == 'production':
        
        # get xgboost parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_bayes_opt_results_parameters.pkl')
        
        dict_file = open(models_dir + 'dict_single_parameters.pkl','rb')
        dictionary_single_parameters = pickle.load(dict_file)
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
    
        # transform numpy arrays into dmatrix format
        d_prod = xgb.DMatrix(x_prod, label=y_prod)
        d_calib = xgb.DMatrix(x_calib, label=y_calib)
                
        make_final_predictions(d_calib, d_prod, y_prod, class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, ml_algorithm_name, 
                               dictionary_single_parameters)
              
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError