1. Introduction
In Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling we applied the Bayesian Optimization (BO) package to the Scikit-learn ExtraTreesClassifier algorithm. Here we do the same for XGBoost. As we are using the non Scikit-learn version of XGBoost, there are some modification required from the previous code as opposed to a straightforward drop in for algorithm specific parameters. However, once done, we can access the full power of XGBoost running on GPUs with an efficient hyperparmeter search method.
2. Using Bayesian Optimization
Below is a code fragment showing how to integrate the package with XGBoost.
# xgb single params dict_single_parameters = {'booster':'gbtree', 'verbosity':0, 'objective':'multi:softprob', 'eval_metric':error_metric, # 'mlogloss', 'num_class':number_of_classes} if processor_type == 'cpu': tree_method = 'hist' # auto predictor = 'auto' # 'cpu_predictor' , default = auto elif processor_type == 'gpu': tree_method = 'gpu_hist' predictor = 'gpu_predictor' # default = auto else: print('\ninvalid processor_type in create_models():',processor_type) raise NameError dict_single_parameters['tree_method'] = tree_method dict_single_parameters['predictor'] = predictor # save to results dir dict_file = open(results_dir + 'dict_single_parameters.pkl','wb') pickle.dump(dict_single_parameters, dict_file) maximum_boosting_rounds = 1500 early_stop_rounds = 10 list_boosting_rounds = [] start_time_total = time.time() def xgb_function(eta, max_depth, colsample_bytree, reg_lambda, reg_alpha, subsample, max_bin): dict_parameters = {'eta':eta, 'max_depth':int(max_depth), 'colsample_bytree':colsample_bytree, 'reg_lambda':int(reg_lambda), 'reg_alpha':int(reg_alpha), 'subsample':subsample, 'max_bin':int(max_bin)} dict_parameters.update(dict_single_parameters) watchlist = [(dvalid, 'eval')] xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist, num_boost_round=maximum_boosting_rounds, early_stopping_rounds=early_stop_rounds, verbose_eval=False) multiclass_log_loss = xgb_model.best_score boosting_rounds = xgb_model.best_ntree_limit # record boosting_rounds here, added to bayes opt parameter df below list_boosting_rounds.append(boosting_rounds) # bayes opt is a maximization algorithm, to minimize mlogloss, return 1-this bayes_opt_score = 1.0 - multiclass_log_loss return bayes_opt_score optimizer = BayesianOptimization(f=xgb_function, pbounds={'eta': (0.1, 0.8), 'max_depth': (3, 21), 'colsample_bytree': (0.5, 0.99), 'reg_lambda': (1, 6), 'reg_alpha': (0, 3), 'subsample': (0.6, 0.9), 'max_bin': (32, 257)}, verbose=2) optimizer.maximize(init_points=num_random_points, n_iter=num_iterations) print('\nbest result:', optimizer.max)
Calling the train function for XGBoost requires various parameters. We must determine which parameters will be exposed to BO, which will be fixed, and of these latter, which will be required later for creating a model with all calibration data. Thus we create a dictionary of fixed parameters and save it for later use. Variable parameters are created in a dictionary which is used in the train function call. Note that BO only supports real valued parameters, so we must convert to integers those parameters that are integer valued. Also, we note that BO will only see those parameters that appear in the pbounds dictionary.
As this is a multiclass problem, we use the mlogloss (multiclass logarithmic loss) evaluation metric. Since lower is better and BO is a maximization algorithm, we record the BO score as 1 – mlogloss. Additionally, since we want to use ensembling by first taking the mean of predicted probabilities and then obtaining a definite class prediction via plurality voting, we set the XGBoost objective to multi:softprob instead of multi:softmax. When we call predict, we get the predicted class probabilities. If we were to use multi:softmax, we would get definite classes.
Because of our use of early stopping, we must record the actual number of boosting rounds used for each set of BO parameters. We do this inside of an implicit loop and accumulate results in a list. Later, we will save this with the other BO parameters.
See Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling for an explanation of the other BO parameters.
To collect all results from BO, we have
list_dfs = [] counter = 0 for result in optimizer.res: df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index', columns=['trial' + str(counter)]).T df_temp['bayes opt error'] = result['target'] df_temp['boosting rounds'] = list_boosting_rounds[counter] list_dfs.append(df_temp) counter = counter + 1 df_results = pd.concat(list_dfs, axis=0) df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl') df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')
optimizer.res is a list of dictionaries. The code above puts each dictionary into a DataFrame then stacks them vertically into a single DataFrame. optimizer.max yields the best result (maximum target), it can also be obtained from the single DataFrame. Note that we save the actual number of boosting rounds.
3. Ensembling
To obtain an ensemble, we apply a threshold on the BO error (1 – multiclass log loss), then refit XGBoost on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.
accepted_models_num = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): if df_params.loc[df_params.index[i],'bayes opt error'] > threshold: dict_temp = {'eta':df_params.loc[df_params.index[i],'eta'], 'max_depth':int(df_params.loc[df_params.index[i],'max_depth']), 'colsample_bytree':df_params.loc[df_params.index[i],'colsample_bytree'], 'reg_lambda':int(df_params.loc[df_params.index[i],'reg_lambda']), 'reg_alpha':int(df_params.loc[df_params.index[i],'reg_alpha']), 'subsample':df_params.loc[df_params.index[i],'subsample'], 'max_bin':int(df_params.loc[df_params.index[i],'max_bin'])} dict_temp.update(dict_single_params) ml_model = xgb.train(params=dict_temp, dtrain=dcalib, num_boost_round=df_params.loc[df_params.index[i],'boosting rounds'], verbose_eval=False) # these are class probabilities list_predicted_prob.append(ml_model.predict(dprod)) accepted_models_num = accepted_models_num + 1 if save_models_flag: model_name = ml_name + df_params.index[i] + '.joblib' joblib.dump(ml_model, save_directory + model_name) # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class = np.argmax(mean_probabilities, axis=1)
Note that the fixed XGBoost parameters were read from the saved dictionary mentioned in the previous section. Also, we must remember to set integer values from real values for some parameters.
4. Results
We used the MNIST handwritten digits data set. To reduce run time, we set the threshold to a value such that only the top 3 models were used in the ensemble.
XGBoost Hyperparameters
colsample_bytree | eta | max_bin | max_depth | reg_alpha | reg_lambda | subsample | bayes opt error | boosting rounds | |
---|---|---|---|---|---|---|---|---|---|
trial32 | 0.5036 | 0.1143 | 138.7366 | 4.3728 | 0.1127 | 1.1435 | 0.7542 | 0.93 | 585 |
trial25 | 0.6243 | 0.1208 | 165.0408 | 4.1827 | 0.3094 | 1.1087 | 0.756 | 0.9287 | 556 |
trial28 | 0.5204 | 0.1223 | 219.8433 | 4.4607 | 0.1138 | 5.9987 | 0.6937 | 0.9285 | 694 |
trial33 | 0.5078 | 0.108 | 222.0052 | 9.6904 | 0.2431 | 5.7352 | 0.6508 | 0.9283 | 471 |
trial23 | 0.5015 | 0.1057 | 75.8412 | 20.199 | 0.1203 | 5.9917 | 0.7483 | 0.9279 | 634 |
trial24 | 0.8885 | 0.2177 | 34.6693 | 4.9973 | 0.0492 | 5.9971 | 0.6079 | 0.9268 | 437 |
trial29 | 0.5099 | 0.1047 | 114.9646 | 10.4737 | 0.3855 | 1.3832 | 0.6889 | 0.9267 | 427 |
trial34 | 0.5997 | 0.1039 | 120.2969 | 3.1413 | 0.6369 | 1.0811 | 0.6736 | 0.9255 | 1178 |
trial27 | 0.5061 | 0.1055 | 256.4603 | 5.183 | 2.8751 | 5.3508 | 0.8049 | 0.9254 | 708 |
trial6 | 0.9065 | 0.1948 | 128.575 | 4.8351 | 0.9786 | 4.8585 | 0.699 | 0.9253 | 356 |
trial35 | 0.5302 | 0.1475 | 34.4272 | 3.1057 | 0.3783 | 5.9389 | 0.8962 | 0.9251 | 976 |
trial39 | 0.8587 | 0.109 | 73.075 | 20.642 | 0.0219 | 2.0848 | 0.6047 | 0.9248 | 459 |
trial17 | 0.8886 | 0.1104 | 141.8841 | 10.2712 | 0.0787 | 5.5841 | 0.6529 | 0.9245 | 441 |
trial38 | 0.7377 | 0.1005 | 152.2219 | 3.0285 | 0.1002 | 1.6141 | 0.7522 | 0.9243 | 1013 |
trial30 | 0.6223 | 0.1247 | 72.3965 | 3.3041 | 0.1139 | 5.6578 | 0.7379 | 0.9242 | 981 |
trial31 | 0.9662 | 0.1057 | 106.3717 | 19.5672 | 0.3194 | 5.9989 | 0.8496 | 0.9242 | 582 |
trial19 | 0.8002 | 0.1065 | 255.5909 | 14.1506 | 0.0472 | 5.8918 | 0.8434 | 0.9241 | 408 |
trial37 | 0.9896 | 0.1246 | 166.5842 | 18.89 | 0.0126 | 5.9794 | 0.8777 | 0.9237 | 522 |
trial11 | 0.6187 | 0.1714 | 176.4134 | 3.0073 | 0.0358 | 5.7436 | 0.7478 | 0.9236 | 715 |
trial12 | 0.5094 | 0.3754 | 122.7707 | 19.9095 | 0.0608 | 5.9838 | 0.7708 | 0.9231 | 215 |
trial3 | 0.7484 | 0.3475 | 139.3771 | 5.6049 | 1.3504 | 1.917 | 0.8949 | 0.9229 | 232 |
trial14 | 0.928 | 0.2047 | 218.9513 | 20.9544 | 0.0678 | 5.9256 | 0.8397 | 0.9225 | 298 |
trial15 | 0.971 | 0.1959 | 214.0056 | 3.4067 | 0.2804 | 1.1663 | 0.7772 | 0.9221 | 567 |
trial26 | 0.5693 | 0.1577 | 149.9932 | 20.7862 | 0.0359 | 1.0385 | 0.8587 | 0.9208 | 298 |
trial4 | 0.5669 | 0.2315 | 239.9287 | 12.956 | 1.193 | 3.0651 | 0.6309 | 0.9207 | 239 |
trial8 | 0.6143 | 0.1976 | 152.8536 | 8.7521 | 2.6321 | 5.6361 | 0.6999 | 0.9207 | 542 |
trial20 | 0.5065 | 0.138 | 204.0469 | 12.5213 | 2.971 | 5.748 | 0.8745 | 0.9193 | 585 |
trial36 | 0.5045 | 0.1217 | 125.4944 | 13.3205 | 2.9554 | 1.3617 | 0.6119 | 0.919 | 561 |
trial22 | 0.7123 | 0.2928 | 32.2564 | 3.1269 | 2.7891 | 4.6379 | 0.7154 | 0.9171 | 618 |
trial0 | 0.9712 | 0.3908 | 239.9837 | 7.3847 | 1.8449 | 4.4724 | 0.6883 | 0.9158 | 152 |
trial21 | 0.8875 | 0.1933 | 95.2283 | 20.9996 | 2.9108 | 1.1088 | 0.7986 | 0.9154 | 413 |
trial7 | 0.5194 | 0.503 | 52.7982 | 6.2929 | 2.445 | 5.222 | 0.6559 | 0.9152 | 208 |
trial2 | 0.7217 | 0.2633 | 179.238 | 17.668 | 2.2636 | 2.0659 | 0.8744 | 0.9145 | 237 |
trial5 | 0.6014 | 0.559 | 51.8397 | 19.3168 | 1.9832 | 4.4672 | 0.7479 | 0.9111 | 206 |
trial16 | 0.5701 | 0.6233 | 256.0727 | 3.7648 | 0.1176 | 1.1933 | 0.8054 | 0.9101 | 196 |
trial13 | 0.6617 | 0.6395 | 100.4746 | 3.2355 | 0.0473 | 5.7955 | 0.8091 | 0.91 | 224 |
trial10 | 0.8724 | 0.3478 | 255.9409 | 20.8715 | 2.8753 | 1.3192 | 0.7469 | 0.9093 | 204 |
trial9 | 0.5434 | 0.6934 | 32.8279 | 18.1195 | 1.2451 | 5.5928 | 0.7382 | 0.9073 | 229 |
trial1 | 0.5144 | 0.641 | 42.4464 | 11.1021 | 0.7396 | 1.0685 | 0.6092 | 0.9027 | 82 |
trial18 | 0.517 | 0.7632 | 237.857 | 20.8503 | 0.0554 | 1.6942 | 0.6681 | 0.8928 | 113 |
balanced accuracy score = 0.9819
accuracy score = 0.982
number of accepted models = 3 for threshold = 0.9284



5. Code
import os import numpy as np import pandas as pd from bayes_opt import BayesianOptimization import xgboost as xgb import joblib from sklearn.model_selection import train_test_split from sklearn.metrics import balanced_accuracy_score, accuracy_score from sklearn.metrics import confusion_matrix, classification_report import pickle import time from pathlib import Path import sys sys.path.append(YOUR DIRECTORY) import confusion_matrix_plot # ******************************************************************* def create_save_models_bayes_opt(dtrain, dvalid, error_metric, num_random_points, num_iterations, results_dir, processor_type, number_of_classes): # xgb single params dict_single_parameters = {'booster':'gbtree', 'verbosity':0, 'objective':'multi:softprob', 'eval_metric':error_metric, # 'mlogloss', 'num_class':number_of_classes} if processor_type == 'cpu': tree_method = 'hist' # auto predictor = 'auto' # 'cpu_predictor' , default = auto elif processor_type == 'gpu': tree_method = 'gpu_hist' predictor = 'gpu_predictor' # default = auto else: print('\ninvalid processor_type in create_models():',processor_type) raise NameError dict_single_parameters['tree_method'] = tree_method dict_single_parameters['predictor'] = predictor # save to results dir dict_file = open(results_dir + 'dict_single_parameters.pkl','wb') pickle.dump(dict_single_parameters, dict_file) maximum_boosting_rounds = 1500 early_stop_rounds = 10 list_boosting_rounds = [] start_time_total = time.time() def xgb_function(eta, max_depth, colsample_bytree, reg_lambda, reg_alpha, subsample, max_bin): dict_parameters = {'eta':eta, 'max_depth':int(max_depth), 'colsample_bytree':colsample_bytree, 'reg_lambda':int(reg_lambda), 'reg_alpha':int(reg_alpha), 'subsample':subsample, 'max_bin':int(max_bin)} dict_parameters.update(dict_single_parameters) watchlist = [(dvalid, 'eval')] xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist, num_boost_round=maximum_boosting_rounds, early_stopping_rounds=early_stop_rounds, verbose_eval=False) multiclass_log_loss = xgb_model.best_score boosting_rounds = xgb_model.best_ntree_limit # record boosting_rounds here, added to bayes opt parameter df below list_boosting_rounds.append(boosting_rounds) # bayes opt is a maximization algorithm, to minimize mlogloss, return 1-this bayes_opt_score = 1.0 - multiclass_log_loss return bayes_opt_score optimizer = BayesianOptimization(f=xgb_function, pbounds={'eta': (0.1, 0.8), 'max_depth': (3, 21), 'colsample_bytree': (0.5, 0.99), 'reg_lambda': (1, 6), 'reg_alpha': (0, 3), 'subsample': (0.6, 0.9), 'max_bin': (32, 257)}, verbose=2) optimizer.maximize(init_points=num_random_points, n_iter=num_iterations) print('\nbest result:', optimizer.max) elapsed_time_total = (time.time()-start_time_total)/60 print('\n\ntotal elapsed time =',elapsed_time_total,' minutes') # optimizer.res is a list of dicts list_dfs = [] counter = 0 for result in optimizer.res: df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index', columns=['trial' + str(counter)]).T df_temp['bayes opt error'] = result['target'] df_temp['boosting rounds'] = list_boosting_rounds[counter] list_dfs.append(df_temp) counter = counter + 1 df_results = pd.concat(list_dfs, axis=0) df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl') df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv') # end of create_save_models_bayes_opt() # ******************************************************************* def make_final_predictions(dcalib, dprod, yprod, list_class_names, models_directory, save_directory, save_models_flag, df_params, threshold, ml_name, dict_single_params): # apply threshold accepted_models_num = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): if df_params.loc[df_params.index[i],'bayes opt error'] > threshold: dict_temp = {'eta':df_params.loc[df_params.index[i],'eta'], 'max_depth':int(df_params.loc[df_params.index[i],'max_depth']), 'colsample_bytree':df_params.loc[df_params.index[i],'colsample_bytree'], 'reg_lambda':int(df_params.loc[df_params.index[i],'reg_lambda']), 'reg_alpha':int(df_params.loc[df_params.index[i],'reg_alpha']), 'subsample':df_params.loc[df_params.index[i],'subsample'], 'max_bin':int(df_params.loc[df_params.index[i],'max_bin'])} dict_temp.update(dict_single_params) ml_model = xgb.train(params=dict_temp, dtrain=dcalib, num_boost_round=df_params.loc[df_params.index[i],'boosting rounds'], verbose_eval=False) # these are class probabilities list_predicted_prob.append(ml_model.predict(dprod)) accepted_models_num = accepted_models_num + 1 if save_models_flag: model_name = ml_name + df_params.index[i] + '.joblib' joblib.dump(ml_model, save_directory + model_name) # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class = np.argmax(mean_probabilities, axis=1) # compute and save error measures # print info to file stdout_default = sys.stdout sys.stdout = open(save_directory + ml_name + '_prediction_results.txt','w') print('balanced accuracy score =',balanced_accuracy_score(yprod, y_predicted_class)) print('accuracy score =',accuracy_score(yprod, y_predicted_class)) print('number of accepted models =',accepted_models_num,' for threshold =',threshold) print('\nclassification report:') print(classification_report(yprod, y_predicted_class, digits=3, output_dict=False)) print('\nraw confusion matrix:') cm_raw = confusion_matrix(yprod, y_predicted_class) print(cm_raw) print('\nconfusion matrix normalized by prediction:') cm_pred = confusion_matrix(yprod, y_predicted_class, normalize='pred') print(cm_pred) print('\nconfusion matrix normalized by true:') cm_true = confusion_matrix(yprod, y_predicted_class, normalize='true') print(cm_true) sys.stdout = stdout_default # plot and save confustion matrices figure_size = (12, 8) number_of_decimals = 4 confusion_matrix_plot.confusion_matrix_save_and_plot(cm_raw, list_class_names, save_directory, 'Confusion Matrix', ml_name + '_confusion_matrix', False, None, 30, figure_size, number_of_decimals) confusion_matrix_plot.confusion_matrix_save_and_plot(cm_pred, list_class_names, save_directory, 'Confusion Matrix Normalized by Prediction', ml_name + '_confusion_matrix_norm_by_prediction', False, 'pred', 30, figure_size, number_of_decimals) confusion_matrix_plot.confusion_matrix_save_and_plot(cm_true, list_class_names, save_directory, 'Confusion Matrix Normalized by Actual', ml_name + '_confusion_matrix_norm_by_true', False, 'true', 30, figure_size, number_of_decimals) # end of make_final_predictions() # ******************************************************************* if __name__ == '__main__': type_of_processor = 'gpu' ml_algorithm_name = 'xgb' file_name_stub = ml_algorithm_name + '_bayes_opt' calculation_type = 'production' #'calibration' 'production' data_directory = YOUR DIRECTORY base_directory = YOUR DIRECTORY results_directory_stub = base_directory + file_name_stub + '/' if not Path(results_directory_stub).is_dir(): os.mkdir(results_directory_stub) # fixed parameters error_type = 'mlogloss' # multi class log loss threshold_error = 0.9284 #this is 1 - mlogloss total_number_of_iterations = 40 number_of_random_points = 10 # random searches to start opt process # this is # of bayes iters, thus total=this + # of random pts number_of_iterations = total_number_of_iterations - number_of_random_points save_models = False # read data x_calib = np.load(data_directory + 'x_mnist_calibration.npy') y_calib = np.load(data_directory + 'y_mnist_calibration.npy') d_calib = xgb.DMatrix(x_calib, label=y_calib) num_classes = np.unique(y_calib).shape[0] x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib, train_size=0.75, shuffle=True, stratify=y_calib) # transform numpy arrays into dmatrix format d_train = xgb.DMatrix(x_train, label=y_train) d_valid = xgb.DMatrix(x_valid, label=y_valid) print('\n*** starting at',pd.Timestamp.now()) if calculation_type == 'calibration': results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) create_save_models_bayes_opt(d_train, d_valid, error_type, number_of_random_points, number_of_iterations, results_directory, type_of_processor, num_classes) elif calculation_type == 'production': # get xgboost parameters models_dir = results_directory_stub + 'calibration/' df_parameters = pd.read_pickle(models_dir + 'df_bayes_opt_results_parameters.pkl') dict_file = open(models_dir + 'dict_single_parameters.pkl','rb') dictionary_single_parameters = pickle.load(dict_file) results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) x_prod = np.load(data_directory + 'x_mnist_production.npy') y_prod = np.load(data_directory + 'y_mnist_production.npy') num_classes = np.unique(y_prod).shape[0] class_names_list = [] for i in range(num_classes): class_names_list.append('class ' + str(i)) # transform numpy arrays into dmatrix format d_prod = xgb.DMatrix(x_prod, label=y_prod) d_calib = xgb.DMatrix(x_calib, label=y_calib) make_final_predictions(d_calib, d_prod, y_prod, class_names_list, models_dir, results_directory, save_models, df_parameters, threshold_error, ml_algorithm_name, dictionary_single_parameters) else: print('\ninvalid calculation type:',calculation_type) raise NameError