Hyperparameter Search With Optuna: Part 2 – XGBoost Classification and Ensembling

Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling

1. Introduction

In this article, we use the tree-structured Parzen algorithm via Optuna to find hyperparameters for XGBoost for the the MNIST handwritten digits data set classification problem.

2. Using Optuna With XGBoost

To integrate XGBoost with Optuna, we use the following class.

class Objective(object):
    def __init__(self, dtrain, dvalid, dict_single_params,
                 max_boosting_rounds, early_stop, dir_save):
        self.dtrain = dtrain
        self.dvalid = dvalid
        self.dict_single_params = dict_single_params
        self.maximum_boosting_rounds = max_boosting_rounds
        self.early_stop_rounds = early_stop
        self.dir_save = dir_save

    def __call__(self, trial):
        dictionary_single_params = self.dict_single_params
        
        list_bins = [25, 50, 75, 100, 125, 150, 175, 200, 225, 250]                 
        
        eta = trial.suggest_uniform('eta', 0.1, 0.8)
        max_depth = trial.suggest_int('max_depth', 3, 20)
        colsample_bytree=trial.suggest_discrete_uniform('colsample_bytree',0.5,1.0,0.05)
        reg_lambda = trial.suggest_int('reg_lambda', 1, 6)
        reg_alpha = trial.suggest_int('reg_alpha', 0, 3)
        subsample = trial.suggest_discrete_uniform('subsample', 0.6, 0.9, 0.1)
        max_bin = trial.suggest_categorical('max_bin', list_bins)
        
        dict_variable_params = {'eta':eta, 'max_depth':max_depth, 
        'colsample_bytree':colsample_bytree, 'reg_lambda':reg_lambda,
        'reg_alpha':reg_alpha, 'subsample':subsample, 'max_bin':max_bin}
                      
        dictionary_single_params.update(dict_variable_params)  
        
        watchlist = [(self.dvalid, 'eval')]  
        
        xgb_model = xgb.train(params=dictionary_single_params, 
                    dtrain=self.dtrain, evals=watchlist,
                    num_boost_round=self.maximum_boosting_rounds,
                    early_stopping_rounds=self.early_stop_rounds,
                    verbose_eval=False)
                
        multiclass_log_loss = xgb_model.best_score
        
        # save model
        if xgb_model.best_ntree_limit <= self.maximum_boosting_rounds:
            joblib.dump(xgb_model, self.dir_save + str(trial.number) + '_xgb.joblib')
                        
        return multiclass_log_loss

We are compelled to use a class here instead of simply using a function due to how Optuna has been coded (see comments here).

See remarks in our article Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling for how different types of parameter ranges are expressed in Optuna.

We decided to save all XGBoost models that converged. We could have also applied a minimum multiclass log loss threshold at this point, but decided to do this later in the production phase as we did not have a good feel for setting the threshold value.

dict_single_params comes from the main code:

error_type = 'mlogloss'
dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                          'objective':'multi:softprob',
                          'eval_metric':error_type,
                          'num_class':number_of_classes,
                          'tree_method':'gpu_hist',
                          'predictor':'gpu_predictor'}

Setting the objective to multi:softprob allows us to recover class probabilities from predictions.

In the main code, we have:

optimizer_direction = 'minimize'
number_of_random_points = 25  # random searches to start opt process
maximum_time = 3*60*60  # seconds

objective = Objective(d_train, d_valid, dict_single_parameters,
                      maximum_boosting_rounds, early_stop_rounds,
                      results_directory)

optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(direction=optimizer_direction,
        sampler=TPESampler(n_startup_trials=number_of_random_points))

study.optimize(objective, timeout=maximum_time)

# save results
df_results = study.trials_dataframe()
df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
df_results.to_csv(results_directory + 'df_optuna_results.csv')

One of the most useful aspects of Optuna is the ability to save information about each trial in a DataFrame. See our previous article for information about other Optuna parameters in this code fragment.

3. Results

Below is the DataFrame from the Optuna study. We sorted by the ‘value’ columns (this is the multiclass log loss) and only kept the 25 best results.

Optuna Results DataFrame

number	value	params_colsample_bytree	params_eta	params_max_bin	params_max_depth	params_reg_lambda	params_subsample	system_attrs__number	state
122	0.1579	0.85	0.1473	125	6	6	0.9	122	COMPLETE
121	0.1585	0.85	0.1178	125	6	6	0.9	121	COMPLETE
106	0.1586	0.8	0.1556	250	6	6	0.9	106	COMPLETE
115	0.1589	0.85	0.1756	250	6	6	0.9	115	COMPLETE
99	0.1590	0.85	0.1613	250	5	5	0.9	99	COMPLETE
124	0.1593	0.85	0.1237	125	6	6	0.9	124	COMPLETE
68	0.1593	0.85	0.1705	250	7	5	0.9	68	COMPLETE
116	0.1594	0.85	0.1222	125	6	6	0.9	116	COMPLETE
108	0.1594	0.85	0.1324	250	5	6	0.9	108	COMPLETE
81	0.1596	0.8	0.2273	250	7	6	0.9	81	COMPLETE
123	0.1596	0.85	0.1186	125	6	6	0.9	123	COMPLETE
76	0.1602	0.5	0.2719	250	7	5	0.9	76	COMPLETE
119	0.1603	0.85	0.1387	125	6	6	0.9	119	COMPLETE
117	0.1604	0.85	0.1313	125	4	6	0.9	117	COMPLETE
72	0.1605	0.85	0.1667	250	7	5	0.9	72	COMPLETE
96	0.1606	0.85	0.2263	250	8	6	0.9	96	COMPLETE
101	0.1606	0.85	0.1878	250	5	6	0.9	101	COMPLETE
98	0.1606	0.85	0.1618	250	5	6	0.9	98	COMPLETE
74	0.1608	0.8	0.2244	250	7	5	0.9	74	COMPLETE
71	0.1611	0.85	0.1671	250	7	5	0.9	71	COMPLETE
88	0.1611	0.85	0.2629	250	8	6	0.9	88	COMPLETE
100	0.1613	0.85	0.1904	250	5	6	0.9	100	COMPLETE
111	0.1613	0.85	0.1839	250	6	6	0.9	111	COMPLETE
70	0.1616	0.85	0.1693	250	7	5	0.9	70	COMPLETE
80	0.1618	0.8	0.2123	250	7	6	0.6	80	COMPLETE

To create the final result, we set a minimum loss threshold of 0.162 and only used the 25 best models. Then we averaged the resulting class probabilities and used plurality voting to obtain final class predictions.

def make_final_predictions(xprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, df_params,
                           threshold, ml_name, num_models_accept, 
                           optimization_direction):
    
    dprod = xgb.DMatrix(xprod, label=yprod)
    
    if optimization_direction == 'maximize':
        df_params.sort_values(by='value', ascending=False, inplace=True)
    else:
        df_params.sort_values(by='value', ascending=True, inplace=True)
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models_total = df_params.shape[0]
    for i in range(num_models_total):
        if optimization_direction == 'maximize':
            bool1 = df_params.loc[df_params.index[i],'value'] > threshold
        else:
            bool1 = df_params.loc[df_params.index[i],'value'] < threshold
                    
        bool2 = df_params.loc[df_params.index[i],'state'] == 'COMPLETE'
        bool3 = accepted_models_num < num_models_accept
        if bool1 and bool2 and bool3:
            model_number = str(df_params.loc[df_params.index[i],'number'])
            
            try:
                xgb_model = joblib.load(models_directory + model_number + '_xgb.joblib')
            except:
                print('\ncould not read model number:',model_number)
            else:
                list_predicted_prob.append(xgb_model.predict(dprod))            
                accepted_models_num = accepted_models_num + 1

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)

balanced accuracy score = 0.9597
accuracy score = 0.9599

4. Code

Below is the code that has not already been shown in the sections above.

import os
import numpy as np
import pandas as pd
import xgboost as xgb
import optuna
from optuna.samplers import TPESampler
from sklearn.model_selection import train_test_split
import joblib
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from pathlib import Path
import sys
       
# *******************************************************************

if __name__ == '__main__':    
    
    ml_algorithm_name = 'xgboost'
    file_name_stub = 'optuna_' + ml_algorithm_name 
    
    calculation_type = 'production' #'calibration' 'production'
    
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
        
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
                
    # fixed parameters - calibration
    maximum_boosting_rounds = 1500
    early_stop_rounds = 10
    error_type = 'mlogloss'
    optimizer_direction = 'minimize'
    number_of_random_points = 25  # random searches to start opt process
    maximum_time = 3*60*60  # seconds
              
    # fixed parameters - production
    threshold_error = 0.162  # multiclass log loss
    number_of_models = 25
            
    print('\n*** starting at',pd.Timestamp.now())
    start_time_total = time.time()

    # calibration 
    if calculation_type == 'calibration':
        
        # use small data set for illustration purposes
        x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
        y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
        
        number_of_classes = np.unique(y_calib).shape[0]
        
        x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib,
        train_size=0.6, shuffle=True, stratify=y_calib)
        
        # transform numpy arrays into dmatrix format
        d_train = xgb.DMatrix(x_train, label=y_train)
        d_valid = xgb.DMatrix(x_valid, label=y_valid)
        
        # xgb single params
        dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                                  'objective':'multi:softprob',
                                  'eval_metric':error_type,
                                  'num_class':number_of_classes,
                                  'tree_method':'gpu_hist',
                                  'predictor':'gpu_predictor'}
                
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
                
        objective = Objective(d_train, d_valid, dict_single_parameters,
                              maximum_boosting_rounds, early_stop_rounds,
                              results_directory)
    
        optuna.logging.set_verbosity(optuna.logging.WARNING)
        study = optuna.create_study(direction=optimizer_direction,
                sampler=TPESampler(n_startup_trials=number_of_random_points))
        
        study.optimize(objective, timeout=maximum_time)
        
        # save results
        df_results = study.trials_dataframe()
        df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
        df_results.to_csv(results_directory + 'df_optuna_results.csv')
        
        
        elapsed_time_total = (time.time()-start_time_total)/60
        print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
        
    elif calculation_type == 'production':
        
        # get optuna results parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_optuna_results.pkl')
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                
        make_final_predictions(x_prod, y_prod, 
                               class_names_list, 
                               models_dir, results_directory, 
                               df_parameters, 
                               threshold_error, file_name_stub,
                               number_of_models, optimizer_direction)
             
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError