Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling

  1. Introduction
  2. Using Optuna With XGBoost
  3. Results
  4. Code

1. Introduction

In this article, we use the tree-structured Parzen algorithm via Optuna to find hyperparameters for XGBoost for the the MNIST handwritten digits data set classification problem.

2. Using Optuna With XGBoost

To integrate XGBoost with Optuna, we use the following class.

class Objective(object):
    def __init__(self, dtrain, dvalid, dict_single_params,
                 max_boosting_rounds, early_stop, dir_save):
        self.dtrain = dtrain
        self.dvalid = dvalid
        self.dict_single_params = dict_single_params
        self.maximum_boosting_rounds = max_boosting_rounds
        self.early_stop_rounds = early_stop
        self.dir_save = dir_save

    def __call__(self, trial):
        dictionary_single_params = self.dict_single_params
        
        list_bins = [25, 50, 75, 100, 125, 150, 175, 200, 225, 250]                 
        
        eta = trial.suggest_uniform('eta', 0.1, 0.8)
        max_depth = trial.suggest_int('max_depth', 3, 20)
        colsample_bytree=trial.suggest_discrete_uniform('colsample_bytree',0.5,1.0,0.05)
        reg_lambda = trial.suggest_int('reg_lambda', 1, 6)
        reg_alpha = trial.suggest_int('reg_alpha', 0, 3)
        subsample = trial.suggest_discrete_uniform('subsample', 0.6, 0.9, 0.1)
        max_bin = trial.suggest_categorical('max_bin', list_bins)
        
        dict_variable_params = {'eta':eta, 'max_depth':max_depth, 
        'colsample_bytree':colsample_bytree, 'reg_lambda':reg_lambda,
        'reg_alpha':reg_alpha, 'subsample':subsample, 'max_bin':max_bin}
                      
        dictionary_single_params.update(dict_variable_params)  
        
        watchlist = [(self.dvalid, 'eval')]  
        
        xgb_model = xgb.train(params=dictionary_single_params, 
                    dtrain=self.dtrain, evals=watchlist,
                    num_boost_round=self.maximum_boosting_rounds,
                    early_stopping_rounds=self.early_stop_rounds,
                    verbose_eval=False)
                
        multiclass_log_loss = xgb_model.best_score
        
        # save model
        if xgb_model.best_ntree_limit <= self.maximum_boosting_rounds:
            joblib.dump(xgb_model, self.dir_save + str(trial.number) + '_xgb.joblib')
                        
        return multiclass_log_loss

We are compelled to use a class here instead of simply using a function due to how Optuna has been coded (see comments here).

See remarks in our article Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling for how different types of parameter ranges are expressed in Optuna.

We decided to save all XGBoost models that converged. We could have also applied a minimum multiclass log loss threshold at this point, but decided to do this later in the production phase as we did not have a good feel for setting the threshold value.

dict_single_params comes from the main code:

error_type = 'mlogloss'
dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                          'objective':'multi:softprob',
                          'eval_metric':error_type,
                          'num_class':number_of_classes,
                          'tree_method':'gpu_hist',
                          'predictor':'gpu_predictor'}

Setting the objective to multi:softprob allows us to recover class probabilities from predictions.

In the main code, we have:

optimizer_direction = 'minimize'
number_of_random_points = 25  # random searches to start opt process
maximum_time = 3*60*60  # seconds

objective = Objective(d_train, d_valid, dict_single_parameters,
                      maximum_boosting_rounds, early_stop_rounds,
                      results_directory)

optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(direction=optimizer_direction,
        sampler=TPESampler(n_startup_trials=number_of_random_points))

study.optimize(objective, timeout=maximum_time)

# save results
df_results = study.trials_dataframe()
df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
df_results.to_csv(results_directory + 'df_optuna_results.csv')

One of the most useful aspects of Optuna is the ability to save information about each trial in a DataFrame. See our previous article for information about other Optuna parameters in this code fragment.

3. Results

Below is the DataFrame from the Optuna study. We sorted by the ‘value’ columns (this is the multiclass log loss) and only kept the 25 best results.

Optuna Results DataFrame

numbervalueparams_colsample_bytreeparams_etaparams_max_binparams_max_depthparams_reg_alphaparams_reg_lambdaparams_subsamplesystem_attrs__numberstate
1220.15790.850.14731256060.9122COMPLETE
1210.15850.850.11781256060.9121COMPLETE
1060.15860.80.15562506060.9106COMPLETE
1150.15890.850.17562506060.9115COMPLETE
990.15900.850.16132505050.999COMPLETE
1240.15930.850.12371256060.9124COMPLETE
680.15930.850.17052507050.968COMPLETE
1160.15940.850.12221256060.9116COMPLETE
1080.15940.850.13242505060.9108COMPLETE
810.15960.80.22732507060.981COMPLETE
1230.15960.850.11861256060.9123COMPLETE
760.16020.50.27192507050.976COMPLETE
1190.16030.850.13871256060.9119COMPLETE
1170.16040.850.13131254060.9117COMPLETE
720.16050.850.16672507050.972COMPLETE
960.16060.850.22632508060.996COMPLETE
1010.16060.850.18782505060.9101COMPLETE
980.16060.850.16182505060.998COMPLETE
740.16080.80.22442507050.974COMPLETE
710.16110.850.16712507050.971COMPLETE
880.16110.850.26292508060.988COMPLETE
1000.16130.850.19042505060.9100COMPLETE
1110.16130.850.18392506060.9111COMPLETE
700.16160.850.16932507050.970COMPLETE
800.16180.80.21232507060.680COMPLETE

To create the final result, we set a minimum loss threshold of 0.162 and only used the 25 best models. Then we averaged the resulting class probabilities and used plurality voting to obtain final class predictions.

def make_final_predictions(xprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, df_params,
                           threshold, ml_name, num_models_accept, 
                           optimization_direction):
    
    dprod = xgb.DMatrix(xprod, label=yprod)
    
    if optimization_direction == 'maximize':
        df_params.sort_values(by='value', ascending=False, inplace=True)
    else:
        df_params.sort_values(by='value', ascending=True, inplace=True)
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models_total = df_params.shape[0]
    for i in range(num_models_total):
        if optimization_direction == 'maximize':
            bool1 = df_params.loc[df_params.index[i],'value'] > threshold
        else:
            bool1 = df_params.loc[df_params.index[i],'value'] < threshold
                    
        bool2 = df_params.loc[df_params.index[i],'state'] == 'COMPLETE'
        bool3 = accepted_models_num < num_models_accept
        if bool1 and bool2 and bool3:
            model_number = str(df_params.loc[df_params.index[i],'number'])
            
            try:
                xgb_model = joblib.load(models_directory + model_number + '_xgb.joblib')
            except:
                print('\ncould not read model number:',model_number)
            else:
                list_predicted_prob.append(xgb_model.predict(dprod))            
                accepted_models_num = accepted_models_num + 1

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)

balanced accuracy score = 0.9597
accuracy score = 0.9599


4. Code

Below is the code that has not already been shown in the sections above.

import os
import numpy as np
import pandas as pd
import xgboost as xgb
import optuna
from optuna.samplers import TPESampler
from sklearn.model_selection import train_test_split
import joblib
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from pathlib import Path
import sys
       
# *******************************************************************

if __name__ == '__main__':    
    
    ml_algorithm_name = 'xgboost'
    file_name_stub = 'optuna_' + ml_algorithm_name 
    
    calculation_type = 'production' #'calibration' 'production'
    
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
        
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
                
    # fixed parameters - calibration
    maximum_boosting_rounds = 1500
    early_stop_rounds = 10
    error_type = 'mlogloss'
    optimizer_direction = 'minimize'
    number_of_random_points = 25  # random searches to start opt process
    maximum_time = 3*60*60  # seconds
              
    # fixed parameters - production
    threshold_error = 0.162  # multiclass log loss
    number_of_models = 25
            
    print('\n*** starting at',pd.Timestamp.now())
    start_time_total = time.time()

    # calibration 
    if calculation_type == 'calibration':
        
        # use small data set for illustration purposes
        x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
        y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
        
        number_of_classes = np.unique(y_calib).shape[0]
        
        x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib,
        train_size=0.6, shuffle=True, stratify=y_calib)
        
        # transform numpy arrays into dmatrix format
        d_train = xgb.DMatrix(x_train, label=y_train)
        d_valid = xgb.DMatrix(x_valid, label=y_valid)
        
        # xgb single params
        dict_single_parameters = {'booster':'gbtree', 'verbosity':0,
                                  'objective':'multi:softprob',
                                  'eval_metric':error_type,
                                  'num_class':number_of_classes,
                                  'tree_method':'gpu_hist',
                                  'predictor':'gpu_predictor'}
                
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
                
        objective = Objective(d_train, d_valid, dict_single_parameters,
                              maximum_boosting_rounds, early_stop_rounds,
                              results_directory)
    
        optuna.logging.set_verbosity(optuna.logging.WARNING)
        study = optuna.create_study(direction=optimizer_direction,
                sampler=TPESampler(n_startup_trials=number_of_random_points))
        
        study.optimize(objective, timeout=maximum_time)
        
        # save results
        df_results = study.trials_dataframe()
        df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
        df_results.to_csv(results_directory + 'df_optuna_results.csv')
        
        
        elapsed_time_total = (time.time()-start_time_total)/60
        print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
        
    elif calculation_type == 'production':
        
        # get optuna results parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_optuna_results.pkl')
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                
        make_final_predictions(x_prod, y_prod, 
                               class_names_list, 
                               models_dir, results_directory, 
                               df_parameters, 
                               threshold_error, file_name_stub,
                               number_of_models, optimizer_direction)
             
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError