Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling
1. Introduction
In this article, we use the tree-structured Parzen algorithm via Optuna to find hyperparameters for XGBoost for the the MNIST handwritten digits data set classification problem.
2. Using Optuna With XGBoost
To integrate XGBoost with Optuna, we use the following class.
class Objective(object): def __init__(self, dtrain, dvalid, dict_single_params, max_boosting_rounds, early_stop, dir_save): self.dtrain = dtrain self.dvalid = dvalid self.dict_single_params = dict_single_params self.maximum_boosting_rounds = max_boosting_rounds self.early_stop_rounds = early_stop self.dir_save = dir_save def __call__(self, trial): dictionary_single_params = self.dict_single_params list_bins = [25, 50, 75, 100, 125, 150, 175, 200, 225, 250] eta = trial.suggest_uniform('eta', 0.1, 0.8) max_depth = trial.suggest_int('max_depth', 3, 20) colsample_bytree=trial.suggest_discrete_uniform('colsample_bytree',0.5,1.0,0.05) reg_lambda = trial.suggest_int('reg_lambda', 1, 6) reg_alpha = trial.suggest_int('reg_alpha', 0, 3) subsample = trial.suggest_discrete_uniform('subsample', 0.6, 0.9, 0.1) max_bin = trial.suggest_categorical('max_bin', list_bins) dict_variable_params = {'eta':eta, 'max_depth':max_depth, 'colsample_bytree':colsample_bytree, 'reg_lambda':reg_lambda, 'reg_alpha':reg_alpha, 'subsample':subsample, 'max_bin':max_bin} dictionary_single_params.update(dict_variable_params) watchlist = [(self.dvalid, 'eval')] xgb_model = xgb.train(params=dictionary_single_params, dtrain=self.dtrain, evals=watchlist, num_boost_round=self.maximum_boosting_rounds, early_stopping_rounds=self.early_stop_rounds, verbose_eval=False) multiclass_log_loss = xgb_model.best_score # save model if xgb_model.best_ntree_limit <= self.maximum_boosting_rounds: joblib.dump(xgb_model, self.dir_save + str(trial.number) + '_xgb.joblib') return multiclass_log_loss
We are compelled to use a class here instead of simply using a function due to how Optuna has been coded (see comments here).
See remarks in our article Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling for how different types of parameter ranges are expressed in Optuna.
We decided to save all XGBoost models that converged. We could have also applied a minimum multiclass log loss threshold at this point, but decided to do this later in the production phase as we did not have a good feel for setting the threshold value.
dict_single_params comes from the main code:
error_type = 'mlogloss' dict_single_parameters = {'booster':'gbtree', 'verbosity':0, 'objective':'multi:softprob', 'eval_metric':error_type, 'num_class':number_of_classes, 'tree_method':'gpu_hist', 'predictor':'gpu_predictor'}
Setting the objective to multi:softprob allows us to recover class probabilities from predictions.
In the main code, we have:
optimizer_direction = 'minimize' number_of_random_points = 25 # random searches to start opt process maximum_time = 3*60*60 # seconds objective = Objective(d_train, d_valid, dict_single_parameters, maximum_boosting_rounds, early_stop_rounds, results_directory) optuna.logging.set_verbosity(optuna.logging.WARNING) study = optuna.create_study(direction=optimizer_direction, sampler=TPESampler(n_startup_trials=number_of_random_points)) study.optimize(objective, timeout=maximum_time) # save results df_results = study.trials_dataframe() df_results.to_pickle(results_directory + 'df_optuna_results.pkl') df_results.to_csv(results_directory + 'df_optuna_results.csv')
One of the most useful aspects of Optuna is the ability to save information about each trial in a DataFrame. See our previous article for information about other Optuna parameters in this code fragment.
3. Results
Below is the DataFrame from the Optuna study. We sorted by the ‘value’ columns (this is the multiclass log loss) and only kept the 25 best results.
Optuna Results DataFrame
number | value | params_colsample_bytree | params_eta | params_max_bin | params_max_depth | params_reg_alpha | params_reg_lambda | params_subsample | system_attrs__number | state |
---|---|---|---|---|---|---|---|---|---|---|
122 | 0.1579 | 0.85 | 0.1473 | 125 | 6 | 0 | 6 | 0.9 | 122 | COMPLETE |
121 | 0.1585 | 0.85 | 0.1178 | 125 | 6 | 0 | 6 | 0.9 | 121 | COMPLETE |
106 | 0.1586 | 0.8 | 0.1556 | 250 | 6 | 0 | 6 | 0.9 | 106 | COMPLETE |
115 | 0.1589 | 0.85 | 0.1756 | 250 | 6 | 0 | 6 | 0.9 | 115 | COMPLETE |
99 | 0.1590 | 0.85 | 0.1613 | 250 | 5 | 0 | 5 | 0.9 | 99 | COMPLETE |
124 | 0.1593 | 0.85 | 0.1237 | 125 | 6 | 0 | 6 | 0.9 | 124 | COMPLETE |
68 | 0.1593 | 0.85 | 0.1705 | 250 | 7 | 0 | 5 | 0.9 | 68 | COMPLETE |
116 | 0.1594 | 0.85 | 0.1222 | 125 | 6 | 0 | 6 | 0.9 | 116 | COMPLETE |
108 | 0.1594 | 0.85 | 0.1324 | 250 | 5 | 0 | 6 | 0.9 | 108 | COMPLETE |
81 | 0.1596 | 0.8 | 0.2273 | 250 | 7 | 0 | 6 | 0.9 | 81 | COMPLETE |
123 | 0.1596 | 0.85 | 0.1186 | 125 | 6 | 0 | 6 | 0.9 | 123 | COMPLETE |
76 | 0.1602 | 0.5 | 0.2719 | 250 | 7 | 0 | 5 | 0.9 | 76 | COMPLETE |
119 | 0.1603 | 0.85 | 0.1387 | 125 | 6 | 0 | 6 | 0.9 | 119 | COMPLETE |
117 | 0.1604 | 0.85 | 0.1313 | 125 | 4 | 0 | 6 | 0.9 | 117 | COMPLETE |
72 | 0.1605 | 0.85 | 0.1667 | 250 | 7 | 0 | 5 | 0.9 | 72 | COMPLETE |
96 | 0.1606 | 0.85 | 0.2263 | 250 | 8 | 0 | 6 | 0.9 | 96 | COMPLETE |
101 | 0.1606 | 0.85 | 0.1878 | 250 | 5 | 0 | 6 | 0.9 | 101 | COMPLETE |
98 | 0.1606 | 0.85 | 0.1618 | 250 | 5 | 0 | 6 | 0.9 | 98 | COMPLETE |
74 | 0.1608 | 0.8 | 0.2244 | 250 | 7 | 0 | 5 | 0.9 | 74 | COMPLETE |
71 | 0.1611 | 0.85 | 0.1671 | 250 | 7 | 0 | 5 | 0.9 | 71 | COMPLETE |
88 | 0.1611 | 0.85 | 0.2629 | 250 | 8 | 0 | 6 | 0.9 | 88 | COMPLETE |
100 | 0.1613 | 0.85 | 0.1904 | 250 | 5 | 0 | 6 | 0.9 | 100 | COMPLETE |
111 | 0.1613 | 0.85 | 0.1839 | 250 | 6 | 0 | 6 | 0.9 | 111 | COMPLETE |
70 | 0.1616 | 0.85 | 0.1693 | 250 | 7 | 0 | 5 | 0.9 | 70 | COMPLETE |
80 | 0.1618 | 0.8 | 0.2123 | 250 | 7 | 0 | 6 | 0.6 | 80 | COMPLETE |
To create the final result, we set a minimum loss threshold of 0.162 and only used the 25 best models. Then we averaged the resulting class probabilities and used plurality voting to obtain final class predictions.
def make_final_predictions(xprod, yprod, list_class_names, models_directory, save_directory, df_params, threshold, ml_name, num_models_accept, optimization_direction): dprod = xgb.DMatrix(xprod, label=yprod) if optimization_direction == 'maximize': df_params.sort_values(by='value', ascending=False, inplace=True) else: df_params.sort_values(by='value', ascending=True, inplace=True) # apply threshold accepted_models_num = 0 list_predicted_prob = [] num_models_total = df_params.shape[0] for i in range(num_models_total): if optimization_direction == 'maximize': bool1 = df_params.loc[df_params.index[i],'value'] > threshold else: bool1 = df_params.loc[df_params.index[i],'value'] < threshold bool2 = df_params.loc[df_params.index[i],'state'] == 'COMPLETE' bool3 = accepted_models_num < num_models_accept if bool1 and bool2 and bool3: model_number = str(df_params.loc[df_params.index[i],'number']) try: xgb_model = joblib.load(models_directory + model_number + '_xgb.joblib') except: print('\ncould not read model number:',model_number) else: list_predicted_prob.append(xgb_model.predict(dprod)) accepted_models_num = accepted_models_num + 1 # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class = np.argmax(mean_probabilities, axis=1)
balanced accuracy score = 0.9597
accuracy score = 0.9599
4. Code
Below is the code that has not already been shown in the sections above.
import os import numpy as np import pandas as pd import xgboost as xgb import optuna from optuna.samplers import TPESampler from sklearn.model_selection import train_test_split import joblib from sklearn.metrics import balanced_accuracy_score, accuracy_score from sklearn.metrics import confusion_matrix, classification_report import time from pathlib import Path import sys # ******************************************************************* if __name__ == '__main__': ml_algorithm_name = 'xgboost' file_name_stub = 'optuna_' + ml_algorithm_name calculation_type = 'production' #'calibration' 'production' base_directory = YOUR DIRECTORY data_directory = base_directory + 'data/' results_directory_stub = base_directory + file_name_stub + '/' if not Path(results_directory_stub).is_dir(): os.mkdir(results_directory_stub) # fixed parameters - calibration maximum_boosting_rounds = 1500 early_stop_rounds = 10 error_type = 'mlogloss' optimizer_direction = 'minimize' number_of_random_points = 25 # random searches to start opt process maximum_time = 3*60*60 # seconds # fixed parameters - production threshold_error = 0.162 # multiclass log loss number_of_models = 25 print('\n*** starting at',pd.Timestamp.now()) start_time_total = time.time() # calibration if calculation_type == 'calibration': # use small data set for illustration purposes x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy') y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy') number_of_classes = np.unique(y_calib).shape[0] x_train, x_valid, y_train, y_valid = train_test_split(x_calib ,y_calib, train_size=0.6, shuffle=True, stratify=y_calib) # transform numpy arrays into dmatrix format d_train = xgb.DMatrix(x_train, label=y_train) d_valid = xgb.DMatrix(x_valid, label=y_valid) # xgb single params dict_single_parameters = {'booster':'gbtree', 'verbosity':0, 'objective':'multi:softprob', 'eval_metric':error_type, 'num_class':number_of_classes, 'tree_method':'gpu_hist', 'predictor':'gpu_predictor'} results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) objective = Objective(d_train, d_valid, dict_single_parameters, maximum_boosting_rounds, early_stop_rounds, results_directory) optuna.logging.set_verbosity(optuna.logging.WARNING) study = optuna.create_study(direction=optimizer_direction, sampler=TPESampler(n_startup_trials=number_of_random_points)) study.optimize(objective, timeout=maximum_time) # save results df_results = study.trials_dataframe() df_results.to_pickle(results_directory + 'df_optuna_results.pkl') df_results.to_csv(results_directory + 'df_optuna_results.csv') elapsed_time_total = (time.time()-start_time_total)/60 print('\n\ntotal elapsed time =',elapsed_time_total,' minutes') elif calculation_type == 'production': # get optuna results parameters models_dir = results_directory_stub + 'calibration/' df_parameters = pd.read_pickle(models_dir + 'df_optuna_results.pkl') results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) x_prod = np.load(data_directory + 'x_mnist_production.npy') y_prod = np.load(data_directory + 'y_mnist_production.npy') num_classes = np.unique(y_prod).shape[0] class_names_list = [] for i in range(num_classes): class_names_list.append('class ' + str(i)) make_final_predictions(x_prod, y_prod, class_names_list, models_dir, results_directory, df_parameters, threshold_error, file_name_stub, number_of_models, optimizer_direction) else: print('\ninvalid calculation type:',calculation_type) raise NameError