1. GPyOpt Python Package
GPyOpt is a Python open-source library for Bayesian Optimization developed by the Machine Learning group of the University of Sheffield. It is based on GPy, a Python framework for Gaussian process modelling.
In this article, we demonstrate how to use this package to perform hyperparameter search for a classification problem with XGBoost.
2. Using GPyOpt
Below are code fragments showing how to integrate the package with XGBoost.
We begin by specifying if the problem is one of minimization or maximization. Here, we are again using the MNIST handwritten digits data set for our classification problem. Thus we wish to minimize the multiclass log loss.
if error_metric == 'mlogloss': maximize_bool = False else: raise NameError
Then we set the search space as a single element list of dictionaries in which we label the variable names, set variable types to continuous or discrete, and specify their range of values.
search_space = [ {'name':'eta', 'type':'continuous', 'domain':(0.4, 0.8)}, {'name':'max_depth', 'type':'continuous', 'domain':(3, 21)}, {'name':'colsample_bytree', 'type':'continuous', 'domain':(0.5, 0.99)}, {'name':'reg_lambda', 'type':'discrete', 'domain':(1,2,3,4,5,6)}, {'name':'reg_alpha', 'type':'discrete', 'domain':(0,1,2,3)}, {'name':'subsample', 'type':'continuous', 'domain':(0.6, 0.9)}, {'name':'max_bin', 'type':'discrete', 'domain':(50,75,100,125,150,200,250)}]
We also set fixed parameters required for XGBoost.
dict_single_parameters = {'booster':'gbtree', 'verbosity':0, 'objective':'multi:softprob', 'eval_metric':error_metric, # 'mlogloss', 'num_class':number_of_classes} maximum_boosting_rounds = 1500 early_stop_rounds = 10
Now we create our XGBoost model.
def xgb_function(params): dict_parameters = {'eta':params[0][0], 'max_depth':int(params[0][1]), 'colsample_bytree':params[0][2], 'reg_lambda':int(params[0][3]), 'reg_alpha':int(params[0][4]), 'subsample':params[0][5], 'max_bin':int(params[0][6])} dict_parameters.update(dict_single_parameters) watchlist = [(dvalid, 'eval')] xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist, num_boost_round=maximum_boosting_rounds, early_stopping_rounds=early_stop_rounds, verbose_eval=False) model_name = 'xgb_gpyopt_' + str(np.random.uniform(0,1,))[2:9] multiclass_log_loss = xgb_model.best_score boosting_rounds = xgb_model.best_ntree_limit print('\nmodel name:',model_name) print('multiclass_log_loss =',multiclass_log_loss) print('boosting_rounds = ',boosting_rounds) list_boosting_rounds.append(boosting_rounds) list_model_name.append(model_name) # save model if xgb_model.best_ntree_limit <= maximum_boosting_rounds: joblib.dump(xgb_model, results_dir + model_name + '.joblib') return multiclass_log_loss
params is the search space list of dictionaries. Note that the second element must match the order specified in search_space.
We create our Bayesian optimization model
gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=xgb_function, domain=search_space, model_type='GP', initial_design_numdata=num_random_points, initial_design_type='random', acquisition_type='EI', normalize_Y=True, exact_feval=False, acquisition_optimizer_type='lbfgs', model_update_interval=1, evaluator_type='sequential', batch_size=1, num_cores=os.cpu_count(), verbosity=True, verbosity_model=False, maximize=maximize_bool, de_duplication=True)
We are using Gaussian processes, with expected improvement acquisition function and the limited memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs) optimization algorithm. See https://gpyopt.readthedocs.io/en/latest/GPyOpt.methods.html for information about all of the parameters as well as how to use models other than Gaussian processes.
In the next step, we set up log files and run the optimization by specifying the number of trials.
# text file with parameter info, gp info, best results rf = results_dir + 'report' # text file with header: Iteration Y var_1 var_2 etc. ef = results_dir + 'evaluation' # text file with gp model info mf = results_dir + 'models' gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, evaluations_file=ef, models_file=mf)
Finally, we create a Pandas DataFrame to store the hyperparameter values, loss, boosting rounds, and model name for each trial. Note that integer values are rendered as reals.
header_params = [] for param in search_space: header_params.append(param['name']) df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params) # integer values are rendered as real (1 -> 1.0) df_results[error_metric] = gpyopt_bo.Y df_results['boosting rounds'] = list_boosting_rounds df_results['model name'] = list_model_name
3. Best Result and Ensembling
To obtain an ensemble, we load the saved models, sort by multiclass log loss, and apply a threshold so that only the top 5 models are used. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.
To obtain the single best result, we sorted the DataFrame mentioned in the previous section to obtain the hyperparameters that yielded the best result.
if type_error == 'mlogloss': threshold_multiplier = 1.0 df_params.sort_values(by=type_error, ascending=True, inplace=True) else: raise NameError # apply threshold accepted_models = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): bool1 = df_params.loc[df_params.index[i],type_error] < threshold_multiplier*threshold bool2 = accepted_models < accepted_models_num if bool1 and bool2: model_name = str(df_params.loc[df_params.index[i],'model name']) try: xgb_model = joblib.load(models_directory + model_name + '.joblib') except: print('\ncould not read', model_name) else: # these are class probabilities list_predicted_prob.append(xgb_model.predict(dprod)) # best result if i == 0: y_predicted_class_best = np.argmax(xgb_model.predict(dprod), axis=1) name_result = ml_name + '_best' classification_analysis.classification_analysis( yprod, y_predicted_class_best, list_class_names, save_directory, name_result) accepted_models = accepted_models + 1 # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)
4. Results
We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.
eta | max_depth | colsample_bytree | reg_lambda | reg_alpha | subsample | max_bin | mlogloss | boosting rounds |
---|---|---|---|---|---|---|---|---|
0.4000 | 13.5529 | 0.5000 | 6 | 0 | 0.6000 | 75 | 0.078332 | 270 |
0.4000 | 13.0877 | 0.5000 | 6 | 0 | 0.6000 | 75 | 0.078332 | 270 |
0.4000 | 11.7554 | 0.9900 | 6 | 0 | 0.6000 | 75 | 0.078637 | 292 |
0.4567 | 14.9832 | 0.7403 | 6 | 0 | 0.6996 | 75 | 0.078739 | 191 |
0.4448 | 16.4082 | 0.6652 | 6 | 0 | 0.6976 | 75 | 0.079514 | 208 |
Best results:
balanced accuracy score = 0.9759
accuracy score = 0.9761



Ensemble results:
balanced accuracy score = 0.978
accuracy score = 0.9782
5. Code
import os import numpy as np import pandas as pd import GPyOpt import xgboost as xgb from sklearn.model_selection import train_test_split import pickle import joblib import time from pathlib import Path # ******************************************************************* def create_save_models(dtrain, dvalid, error_metric, num_random_points, num_iterations, results_dir, processor_type, number_of_classes): dict_single_parameters = {'booster':'gbtree', 'verbosity':0, 'objective':'multi:softprob', 'eval_metric':error_metric, 'num_class':number_of_classes} if processor_type == 'cpu': tree_method = 'hist' predictor = 'auto' elif processor_type == 'gpu': tree_method = 'gpu_hist' predictor = 'gpu_predictor' else: print('\ninvalid processor_type in create_models():',processor_type) raise NameError dict_single_parameters['tree_method'] = tree_method dict_single_parameters['predictor'] = predictor # save to results dir dict_file = open(results_dir + 'dict_single_parameters.pkl','wb') pickle.dump(dict_single_parameters, dict_file) maximum_boosting_rounds = 1500 early_stop_rounds = 10 list_boosting_rounds = [] list_model_name = [] # create search space search_space = [ {'name':'eta', 'type':'continuous', 'domain':(0.4, 0.8)}, {'name':'max_depth', 'type':'continuous', 'domain':(3, 21)}, {'name':'colsample_bytree', 'type':'continuous', 'domain':(0.5, 0.99)}, {'name':'reg_lambda', 'type':'discrete', 'domain':(1,2,3,4,5,6)}, {'name':'reg_alpha', 'type':'discrete', 'domain':(0,1,2,3)}, {'name':'subsample', 'type':'continuous', 'domain':(0.6, 0.9)}, {'name':'max_bin', 'type':'discrete', 'domain':(50,75,100,125,150,200,250)}] if error_metric == 'mlogloss': maximize_bool = False else: raise NameError start_time_total = time.time() def xgb_function(params): dict_parameters = {'eta':params[0][0], 'max_depth':int(params[0][1]), 'colsample_bytree':params[0][2], 'reg_lambda':int(params[0][3]), 'reg_alpha':int(params[0][4]), 'subsample':params[0][5], 'max_bin':int(params[0][6])} dict_parameters.update(dict_single_parameters) watchlist = [(dvalid, 'eval')] xgb_model = xgb.train(params=dict_parameters, dtrain=dtrain, evals=watchlist, num_boost_round=maximum_boosting_rounds, early_stopping_rounds=early_stop_rounds, verbose_eval=False) model_name = 'xgb_gpyopt_' + str(np.random.uniform(0,1,))[2:9] multiclass_log_loss = xgb_model.best_score boosting_rounds = xgb_model.best_ntree_limit print('\nmodel name:',model_name) print('multiclass_log_loss =',multiclass_log_loss) print('boosting_rounds = ',boosting_rounds) list_boosting_rounds.append(boosting_rounds) list_model_name.append(model_name) # save model if xgb_model.best_ntree_limit <= maximum_boosting_rounds: joblib.dump(xgb_model, results_dir + model_name + '.joblib') return multiclass_log_loss gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=xgb_function, domain=search_space, model_type='GP', initial_design_numdata=num_random_points, initial_design_type='random', acquisition_type='EI', normalize_Y=True, exact_feval=False, acquisition_optimizer_type='lbfgs', model_update_interval=1, evaluator_type='sequential', batch_size=1, num_cores=os.cpu_count(), verbosity=False, verbosity_model=False, maximize=maximize_bool, de_duplication=True) # text file with parameter info, gp info, best results rf = results_dir + 'report' # text file with header: Iteration Y var_1 var_2 etc. ef = results_dir + 'evaluation' # text file with gp model info mf = results_dir + 'models' gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, evaluations_file=ef, models_file=mf) elapsed_time_total = (time.time()-start_time_total)/60 print('\n\ntotal elapsed time =',elapsed_time_total,' minutes') # put results in dfs header_params = [] for param in search_space: header_params.append(param['name']) df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params) # integer values are rendered as real (1 -> 1.0) df_results[error_metric] = gpyopt_bo.Y df_results['boosting rounds'] = list_boosting_rounds df_results['model name'] = list_model_name df_results.to_pickle(results_dir + 'df_results.pkl') df_results.to_csv(results_dir + 'df_results.csv') # end of create_save_models() # ******************************************************************* def make_final_predictions(dcalib, dprod, yprod, list_class_names, models_directory, save_directory, save_models_flag, df_params, threshold, ml_name, dict_single_params, type_error, accepted_models_num): if type_error == 'mlogloss': threshold_multiplier = 1.0 df_params.sort_values(by=type_error, ascending=True, inplace=True) else: raise NameError # apply threshold accepted_models = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): bool1 = df_params.loc[df_params.index[i],type_error] < threshold_multiplier*threshold bool2 = accepted_models < accepted_models_num if bool1 and bool2: model_name = str(df_params.loc[df_params.index[i],'model name']) try: xgb_model = joblib.load(models_directory + model_name + '.joblib') except: print('\ncould not read', model_name) else: # these are class probabilities list_predicted_prob.append(xgb_model.predict(dprod)) # best result if i == 0: y_predicted_class_best = np.argmax(xgb_model.predict(dprod), axis=1) name_result = ml_name + '_best' classification_analysis.classification_analysis( yprod, y_predicted_class_best, list_class_names, save_directory, name_result) accepted_models = accepted_models + 1 # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1) # end of make_final_predictions() # ******************************************************************* if __name__ == '__main__': type_of_processor = 'gpu' ml_algorithm_name = 'xgb' file_name_stub = 'gpyopt_' + ml_algorithm_name calculation_type = 'calibration' #'calibration' 'production' base_directory = YOUR DIRECTORY data_directory = YOUR DIRECTORY results_directory_stub = base_directory + file_name_stub + '/' if not Path(results_directory_stub).is_dir(): os.mkdir(results_directory_stub) # fixed parameters error_type = 'mlogloss' threshold_error = 0.09 total_number_of_iterations = 25 number_of_random_points = 5 # random searches to start opt process # this is # of bayes iters, thus total=this + # of random pts number_of_iterations = total_number_of_iterations - number_of_random_points save_models = False num_models_to_accept = 5 # read data x_calib = np.load(data_directory + 'x_mnist_calibration.npy') y_calib = np.load(data_directory + 'y_mnist_calibration.npy') d_calib = xgb.DMatrix(x_calib, label=y_calib) num_classes = np.unique(y_calib).shape[0] x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib, train_size=0.75, shuffle=True, stratify=y_calib) # transform numpy arrays into dmatrix format d_train = xgb.DMatrix(x_train, label=y_train) d_valid = xgb.DMatrix(x_valid, label=y_valid) print('\n*** starting at',pd.Timestamp.now()) if calculation_type == 'calibration': results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) create_save_models(d_train, d_valid, error_type, number_of_random_points, number_of_iterations, results_directory, type_of_processor, num_classes) elif calculation_type == 'production': # get etrees parameters models_dir = results_directory_stub + 'calibration/' df_parameters = pd.read_pickle(models_dir + 'df_results.pkl') dict_file = open(models_dir + 'dict_single_parameters.pkl','rb') dictionary_single_parameters = pickle.load(dict_file) results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) x_prod = np.load(data_directory + 'x_mnist_production.npy') y_prod = np.load(data_directory + 'y_mnist_production.npy') num_classes = np.unique(y_prod).shape[0] class_names_list = [] for i in range(num_classes): class_names_list.append('class ' + str(i)) # transform numpy arrays into dmatrix format d_prod = xgb.DMatrix(x_prod, label=y_prod) d_calib = xgb.DMatrix(x_calib, label=y_calib) make_final_predictions(d_calib, d_prod, y_prod, class_names_list, models_dir, results_directory, save_models, df_parameters, threshold_error, ml_algorithm_name, dictionary_single_parameters, error_type, num_models_to_accept) else: print('\ninvalid calculation type:',calculation_type) raise NameError