1. Introduction
  2. Using Bayesian Optimization
  3. Ensembling
  4. Results
  5. Code

1. Introduction

In Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling we applied the Bayesian Optimization (BO) package to the Scikit-learn ExtraTreesClassifier algorithm. Here we do the same for XGBoost. As we are using the non Scikit-learn version of XGBoost, there are some modification required from the previous code as opposed to a straightforward drop in for algorithm specific parameters. However, once done, we can access the full power of XGBoost running on GPUs with an efficient hyperparmeter search method.

2. Using Bayesian Optimization

Below is a code fragment showing how to integrate the package with XGBoost.

# xgb single params
dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                          'objective':'multi:softprob',
                          'eval_metric':error_metric,  # 'mlogloss',
                          'num_class':number_of_classes}

if processor_type == 'cpu':
    tree_method  = 'hist' # auto
    predictor = 'auto'  # 'cpu_predictor' , default = auto
elif processor_type == 'gpu':
    tree_method  = 'gpu_hist'
    predictor = 'gpu_predictor' # default = auto
else:
    print('\ninvalid processor_type in create_models():',processor_type)
    raise NameError
    
dict_single_parameters['tree_method'] = tree_method
dict_single_parameters['predictor'] = predictor

# save to results dir
dict_file = open(results_dir + 'dict_single_parameters.pkl','wb')
pickle.dump(dict_single_parameters, dict_file)

maximum_boosting_rounds = 1500
early_stop_rounds = 10
list_boosting_rounds = []

start_time_total = time.time()

def xgb_function(eta, max_depth, colsample_bytree, reg_lambda, reg_alpha,
                 subsample, max_bin):
    
    dict_parameters = {'eta':eta, 'max_depth':int(max_depth), 
    'colsample_bytree':colsample_bytree, 'reg_lambda':int(reg_lambda), 
    'reg_alpha':int(reg_alpha), 'subsample':subsample, 'max_bin':int(max_bin)}
    
    dict_parameters.update(dict_single_parameters)        
    
    watchlist = [(dvalid, 'eval')]
    xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist,
                          num_boost_round=maximum_boosting_rounds,
                          early_stopping_rounds=early_stop_rounds,
                          verbose_eval=False)
    
    multiclass_log_loss = xgb_model.best_score
    boosting_rounds = xgb_model.best_ntree_limit
    
    # record boosting_rounds here, added to bayes opt parameter df below
    list_boosting_rounds.append(boosting_rounds)
    
    # bayes opt is a maximization algorithm, to minimize mlogloss, return 1-this
    bayes_opt_score = 1.0 - multiclass_log_loss
    
    return bayes_opt_score

optimizer = BayesianOptimization(f=xgb_function,
            pbounds={'eta': (0.1, 0.8),
                     'max_depth': (3, 21),
                     'colsample_bytree': (0.5, 0.99),
                     'reg_lambda': (1, 6),
                     'reg_alpha': (0, 3),
                     'subsample': (0.6, 0.9),
                     'max_bin': (32, 257)},
                     verbose=2)

optimizer.maximize(init_points=num_random_points, 
                   n_iter=num_iterations)

print('\nbest result:', optimizer.max)

Calling the train function for XGBoost requires various parameters. We must determine which parameters will be exposed to BO, which will be fixed, and of these latter, which will be required later for creating a model with all calibration data. Thus we create a dictionary of fixed parameters and save it for later use. Variable parameters are created in a dictionary which is used in the train function call. Note that BO only supports real valued parameters, so we must convert to integers those parameters that are integer valued. Also, we note that BO will only see those parameters that appear in the pbounds dictionary.

As this is a multiclass problem, we use the mlogloss (multiclass logarithmic loss) evaluation metric. Since lower is better and BO is a maximization algorithm, we record the BO score as 1 – mlogloss. Additionally, since we want to use ensembling by first taking the mean of predicted probabilities and then obtaining a definite class prediction via plurality voting, we set the XGBoost objective to multi:softprob instead of multi:softmax. When we call predict, we get the predicted class probabilities. If we were to use multi:softmax, we would get definite classes.

Because of our use of early stopping, we must record the actual number of boosting rounds used for each set of BO parameters. We do this inside of an implicit loop and accumulate results in a list. Later, we will save this with the other BO parameters.

See Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling for an explanation of the other BO parameters.

To collect all results from BO, we have

list_dfs = []
counter = 0
for result in optimizer.res:
    df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                     columns=['trial' + str(counter)]).T
    df_temp['bayes opt error'] = result['target']
    
    df_temp['boosting rounds'] = list_boosting_rounds[counter]
    
    list_dfs.append(df_temp)
    
    counter = counter + 1
    
df_results = pd.concat(list_dfs, axis=0)
df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')

optimizer.res is a list of dictionaries. The code above puts each dictionary into a DataFrame then stacks them vertically into a single DataFrame. optimizer.max yields the best result (maximum target), it can also be obtained from the single DataFrame. Note that we save the actual number of boosting rounds.

3. Ensembling

To obtain an ensemble, we apply a threshold on the BO error (1 – multiclass log loss), then refit XGBoost on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.

accepted_models_num = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    if df_params.loc[df_params.index[i],'bayes opt error'] > threshold:
        
        dict_temp = {'eta':df_params.loc[df_params.index[i],'eta'], 
        'max_depth':int(df_params.loc[df_params.index[i],'max_depth']), 
        'colsample_bytree':df_params.loc[df_params.index[i],'colsample_bytree'], 
        'reg_lambda':int(df_params.loc[df_params.index[i],'reg_lambda']), 
        'reg_alpha':int(df_params.loc[df_params.index[i],'reg_alpha']), 
        'subsample':df_params.loc[df_params.index[i],'subsample'], 
        'max_bin':int(df_params.loc[df_params.index[i],'max_bin'])}
        
        dict_temp.update(dict_single_params)       
                
        ml_model = xgb.train(params=dict_temp, dtrain=dcalib,
                   num_boost_round=df_params.loc[df_params.index[i],'boosting rounds'],
                   verbose_eval=False)
        
        # these are class probabilities
        list_predicted_prob.append(ml_model.predict(dprod))
        
        accepted_models_num = accepted_models_num + 1
        
        if save_models_flag:
            model_name = ml_name + df_params.index[i] + '.joblib'
            joblib.dump(ml_model, save_directory + model_name)

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class = np.argmax(mean_probabilities, axis=1)

Note that the fixed XGBoost parameters were read from the saved dictionary mentioned in the previous section. Also, we must remember to set integer values from real values for some parameters.

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we set the threshold to a value such that only the top 3 models were used in the ensemble.

XGBoost Hyperparameters

colsample_bytreeetamax_binmax_depthreg_alphareg_lambdasubsamplebayes opt errorboosting rounds
trial320.50360.1143138.73664.37280.11271.14350.75420.93585
trial250.62430.1208165.04084.18270.30941.10870.7560.9287556
trial280.52040.1223219.84334.46070.11385.99870.69370.9285694
trial330.50780.108222.00529.69040.24315.73520.65080.9283471
trial230.50150.105775.841220.1990.12035.99170.74830.9279634
trial240.88850.217734.66934.99730.04925.99710.60790.9268437
trial290.50990.1047114.964610.47370.38551.38320.68890.9267427
trial340.59970.1039120.29693.14130.63691.08110.67360.92551178
trial270.50610.1055256.46035.1832.87515.35080.80490.9254708
trial60.90650.1948128.5754.83510.97864.85850.6990.9253356
trial350.53020.147534.42723.10570.37835.93890.89620.9251976
trial390.85870.10973.07520.6420.02192.08480.60470.9248459
trial170.88860.1104141.884110.27120.07875.58410.65290.9245441
trial380.73770.1005152.22193.02850.10021.61410.75220.92431013
trial300.62230.124772.39653.30410.11395.65780.73790.9242981
trial310.96620.1057106.371719.56720.31945.99890.84960.9242582
trial190.80020.1065255.590914.15060.04725.89180.84340.9241408
trial370.98960.1246166.584218.890.01265.97940.87770.9237522
trial110.61870.1714176.41343.00730.03585.74360.74780.9236715
trial120.50940.3754122.770719.90950.06085.98380.77080.9231215
trial30.74840.3475139.37715.60491.35041.9170.89490.9229232
trial140.9280.2047218.951320.95440.06785.92560.83970.9225298
trial150.9710.1959214.00563.40670.28041.16630.77720.9221567
trial260.56930.1577149.993220.78620.03591.03850.85870.9208298
trial40.56690.2315239.928712.9561.1933.06510.63090.9207239
trial80.61430.1976152.85368.75212.63215.63610.69990.9207542
trial200.50650.138204.046912.52132.9715.7480.87450.9193585
trial360.50450.1217125.494413.32052.95541.36170.61190.919561
trial220.71230.292832.25643.12692.78914.63790.71540.9171618
trial00.97120.3908239.98377.38471.84494.47240.68830.9158152
trial210.88750.193395.228320.99962.91081.10880.79860.9154413
trial70.51940.50352.79826.29292.4455.2220.65590.9152208
trial20.72170.2633179.23817.6682.26362.06590.87440.9145237
trial50.60140.55951.839719.31681.98324.46720.74790.9111206
trial160.57010.6233256.07273.76480.11761.19330.80540.9101196
trial130.66170.6395100.47463.23550.04735.79550.80910.91224
trial100.87240.3478255.940920.87152.87531.31920.74690.9093204
trial90.54340.693432.827918.11951.24515.59280.73820.9073229
trial10.51440.64142.446411.10210.73961.06850.60920.902782
trial180.5170.7632237.85720.85030.05541.69420.66810.8928113

balanced accuracy score = 0.9819
accuracy score = 0.982
number of accepted models = 3 for threshold = 0.9284


5. Code

import os
import numpy as np
import pandas as pd
from bayes_opt import BayesianOptimization
import xgboost as xgb
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import pickle
import time
from pathlib import Path
import sys
sys.path.append(YOUR DIRECTORY)
import confusion_matrix_plot
    
# *******************************************************************
        
def create_save_models_bayes_opt(dtrain, dvalid, error_metric, 
                                 num_random_points, num_iterations, 
                                 results_dir, processor_type,
                                 number_of_classes):
    # xgb single params
    dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                              'objective':'multi:softprob',
                              'eval_metric':error_metric,  # 'mlogloss',
                              'num_class':number_of_classes}
    
    if processor_type == 'cpu':
        tree_method  = 'hist' # auto
        predictor = 'auto'  # 'cpu_predictor' , default = auto
    elif processor_type == 'gpu':
        tree_method  = 'gpu_hist'
        predictor = 'gpu_predictor' # default = auto
    else:
        print('\ninvalid processor_type in create_models():',processor_type)
        raise NameError
        
    dict_single_parameters['tree_method'] = tree_method
    dict_single_parameters['predictor'] = predictor
    
    # save to results dir
    dict_file = open(results_dir + 'dict_single_parameters.pkl','wb')
    pickle.dump(dict_single_parameters, dict_file)
       
    maximum_boosting_rounds = 1500
    early_stop_rounds = 10
    list_boosting_rounds = []
    
    start_time_total = time.time()
    
    def xgb_function(eta, max_depth, colsample_bytree, reg_lambda, reg_alpha,
                     subsample, max_bin):
        
        dict_parameters = {'eta':eta, 'max_depth':int(max_depth), 
        'colsample_bytree':colsample_bytree, 'reg_lambda':int(reg_lambda), 
        'reg_alpha':int(reg_alpha), 'subsample':subsample, 'max_bin':int(max_bin)}
        
        dict_parameters.update(dict_single_parameters)        
        
        watchlist = [(dvalid, 'eval')]
        xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist,
                              num_boost_round=maximum_boosting_rounds,
                              early_stopping_rounds=early_stop_rounds,
                              verbose_eval=False)
        
        multiclass_log_loss = xgb_model.best_score
        boosting_rounds = xgb_model.best_ntree_limit
        
        # record boosting_rounds here, added to bayes opt parameter df below
        list_boosting_rounds.append(boosting_rounds)
        
        # bayes opt is a maximization algorithm, to minimize mlogloss, return 1-this
        bayes_opt_score = 1.0 - multiclass_log_loss
        
        return bayes_opt_score
        
    optimizer = BayesianOptimization(f=xgb_function,
                pbounds={'eta': (0.1, 0.8),
                         'max_depth': (3, 21),
                         'colsample_bytree': (0.5, 0.99),
                         'reg_lambda': (1, 6),
                         'reg_alpha': (0, 3),
                         'subsample': (0.6, 0.9),
                         'max_bin': (32, 257)},
                         verbose=2)
    
    optimizer.maximize(init_points=num_random_points, 
                       n_iter=num_iterations)
    
    print('\nbest result:', optimizer.max)
    
    elapsed_time_total = (time.time()-start_time_total)/60
    print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
    
    # optimizer.res is a list of dicts
    list_dfs = []
    counter = 0
    for result in optimizer.res:
        df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                         columns=['trial' + str(counter)]).T
        df_temp['bayes opt error'] = result['target']
        
        df_temp['boosting rounds'] = list_boosting_rounds[counter]
        
        list_dfs.append(df_temp)
        
        counter = counter + 1
        
    df_results = pd.concat(list_dfs, axis=0)
    df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
    df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')
        
# end of create_save_models_bayes_opt()
    
# *******************************************************************
            
def make_final_predictions(dcalib, dprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, ml_name, dict_single_params):
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        if df_params.loc[df_params.index[i],'bayes opt error'] > threshold:
            
            dict_temp = {'eta':df_params.loc[df_params.index[i],'eta'], 
            'max_depth':int(df_params.loc[df_params.index[i],'max_depth']), 
            'colsample_bytree':df_params.loc[df_params.index[i],'colsample_bytree'], 
            'reg_lambda':int(df_params.loc[df_params.index[i],'reg_lambda']), 
            'reg_alpha':int(df_params.loc[df_params.index[i],'reg_alpha']), 
            'subsample':df_params.loc[df_params.index[i],'subsample'], 
            'max_bin':int(df_params.loc[df_params.index[i],'max_bin'])}
            
            dict_temp.update(dict_single_params)       
            
            
            ml_model = xgb.train(params=dict_temp, dtrain=dcalib,
                       num_boost_round=df_params.loc[df_params.index[i],'boosting rounds'],
                       verbose_eval=False)
            
            # these are class probabilities
            list_predicted_prob.append(ml_model.predict(dprod))
            
            accepted_models_num = accepted_models_num + 1
            
            if save_models_flag:
                model_name = ml_name + df_params.index[i] + '.joblib'
                joblib.dump(ml_model, save_directory + model_name)

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)
    
    # compute and save error measures

    # print info to file
    stdout_default = sys.stdout
    sys.stdout = open(save_directory + ml_name + '_prediction_results.txt','w')
    
    print('balanced accuracy score =',balanced_accuracy_score(yprod, y_predicted_class))
    
    print('accuracy score =',accuracy_score(yprod, y_predicted_class))
    
    print('number of accepted models =',accepted_models_num,' for threshold =',threshold)
    
    print('\nclassification report:')
    print(classification_report(yprod, y_predicted_class, digits=3, output_dict=False))
    
    print('\nraw confusion matrix:')
    cm_raw = confusion_matrix(yprod, y_predicted_class)
    print(cm_raw)
    
    print('\nconfusion matrix normalized by prediction:')
    cm_pred = confusion_matrix(yprod, y_predicted_class, normalize='pred')
    print(cm_pred)
    
    print('\nconfusion matrix normalized by true:')
    cm_true = confusion_matrix(yprod, y_predicted_class, normalize='true')
    print(cm_true)
    
    sys.stdout = stdout_default
    
    # plot and save confustion matrices
    figure_size = (12, 8)
    number_of_decimals = 4
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_raw, 
    list_class_names, save_directory, 'Confusion Matrix', 
    ml_name + '_confusion_matrix', False, None, 30, figure_size,
    number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_pred, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Prediction', 
    ml_name + '_confusion_matrix_norm_by_prediction', False, 'pred', 
    30, figure_size, number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_true, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Actual', 
    ml_name + '_confusion_matrix_norm_by_true', False, 'true', 
    30, figure_size, number_of_decimals)

# end of make_final_predictions()
       
# *******************************************************************

if __name__ == '__main__':    
    
    type_of_processor = 'gpu'
    
    ml_algorithm_name = 'xgb'
    file_name_stub = ml_algorithm_name + '_bayes_opt'  
    
    calculation_type = 'production' #'calibration' 'production'
    
    data_directory = YOUR DIRECTORY
    
    base_directory = YOUR DIRECTORY
    
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
               
    # fixed parameters
    error_type = 'mlogloss'  # multi class log loss
    threshold_error = 0.9284  #this is 1 - mlogloss
    total_number_of_iterations = 40
    number_of_random_points = 10  # random searches to start opt process
    # this is # of bayes iters, thus total=this + # of random pts
    number_of_iterations =  total_number_of_iterations - number_of_random_points
    save_models = False
    
       
    # read data
    x_calib = np.load(data_directory + 'x_mnist_calibration.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration.npy')
    d_calib = xgb.DMatrix(x_calib, label=y_calib)
    num_classes = np.unique(y_calib).shape[0]
    
    x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib,
    train_size=0.75, shuffle=True, stratify=y_calib)
    
    # transform numpy arrays into dmatrix format
    d_train = xgb.DMatrix(x_train, label=y_train)
    d_valid = xgb.DMatrix(x_valid, label=y_valid)
                
    print('\n*** starting at',pd.Timestamp.now())

    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
        
        create_save_models_bayes_opt(d_train, d_valid, error_type,
                                     number_of_random_points, number_of_iterations,
                                     results_directory, type_of_processor,
                                     num_classes)
                          
    elif calculation_type == 'production':
        
        # get xgboost parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_bayes_opt_results_parameters.pkl')
        
        dict_file = open(models_dir + 'dict_single_parameters.pkl','rb')
        dictionary_single_parameters = pickle.load(dict_file)
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
    
        # transform numpy arrays into dmatrix format
        d_prod = xgb.DMatrix(x_prod, label=y_prod)
        d_calib = xgb.DMatrix(x_calib, label=y_calib)
                
        make_final_predictions(d_calib, d_prod, y_prod, class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, ml_algorithm_name, 
                               dictionary_single_parameters)
              
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError