Paper: Optuna: A Next-generation Hyperparameter Optimization Framework – Akiba et al 2019
Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling
Hyperparameter Search With Optuna: Part 2 – XGBoost Classification and Ensembling
Hyperparameter Search With Optuna: Part 3 – Keras (CNN) Classification and Ensembling

  1. Introduction
  2. Asynchronous Successive Halving
  3. Results
  4. Code

1. Introduction

In addition to using the tree-structured Parzen algorithm via Optuna to find hyperparameters for XGBoost for the the MNIST handwritten digits data set classification problem, we add asynchronous successive halving, a pruning algorithm, to halt training when preliminary results are unpromising.

2. Asynchronous Successive Halving

Successive Halving is a bandit-based algorithm to identify the best one among multiple configurations. This class implements an asynchronous version of Successive Halving. Please refer to the paper of Asynchronous Successive Halving for detailed descriptions.

As applied to XGBoost, this means that after a certain number of boosting rounds, if the error metric does not meet a threshold (see the references above), training is stopped (pruned) for that set of hyperparameters. This strategy allows Optuna to sample a greater number of sets of hyperparameters in a given amount of computation time.

To implement pruning, we make the following changes to the code used in Hyperparameter Search With Optuna: Part 2 – XGBoost Classification and Ensembling.

First, in class Objective(object), we add:

prune_error = 'eval-' + dictionary_single_params['eval_metric']
# prune_error = 'eval-mlogloss'

pruning_callback = optuna.integration.XGBoostPruningCallback(trial, prune_error)  

xgb_model = xgb.train(params=dictionary_single_params, 
            dtrain=self.dtrain, evals=watchlist,
            num_boost_round=self.maximum_boosting_rounds,
            early_stopping_rounds=self.early_stop_rounds,
            verbose_eval=False,
            callbacks=[pruning_callback])

Next, in the main code we have:

study = optuna.create_study(direction=optimizer_direction,
                sampler=TPESampler(n_startup_trials=number_of_random_points),
                pruner=optuna.pruners.SuccessiveHalvingPruner(min_resource='auto', 
                       reduction_factor=4, min_early_stopping_rate=0))

‘auto’ means that Optuna uses a heuristic to determine the number of boosting rounds to perform before deciding to enact pruning. This is based on the first number of trials that are run to completion. Alternatively, this parameter can be explicitly set by the user. See the Optuna documentation for definitions of the other parameters.

Finally, in the make_final_predictions() function, the coding is identical to the non-pruned version, but we note that the DataFrame that Optuna uses to store trial results has a column labeled ‘state’ that we actually use. When pruning is enabled, the ‘state’ can have a value of COMPLETE or PRUNED. We only use those saved XGBoost models that ran to completion.

3. Results

Below is the DataFrame from the Optuna study. We sorted by the ‘value’ column (this is the multiclass log loss) and only kept the 25 best results.

Optuna Results DataFrame

numbervalueparams_colsample_bytreeparams_etaparams_max_binparams_max_depthparams_reg_alphaparams_reg_lambdaparams_subsamplesystem_attrs__numbersystem_attrs_completed_rung_0system_attrs_completed_rung_1system_attrs_completed_rung_2system_attrs_completed_rung_3state
6150.16110.850.66072514030.96150.45750.23570.1705COMPLETE
6090.16160.850.65472516030.96090.45720.23200.1690COMPLETE
2780.16230.850.64692512030.92780.46340.23860.1701COMPLETE
8300.16260.850.68382514030.98300.44950.23230.1710COMPLETE
13600.16270.850.68122514030.913600.45060.23530.1713COMPLETE
8190.16310.850.66872514030.98190.45390.23460.1705COMPLETE
7890.16340.850.68112514030.97890.45050.23540.17260.1640COMPLETE
9980.16340.850.67672514030.99980.45120.23590.1730COMPLETE
3090.16350.850.67872511030.93090.44960.23320.1711COMPLETE
3650.16390.850.67682511030.93650.45040.23500.1716COMPLETE
8350.16390.850.68232514030.98350.45010.23480.1732COMPLETE
7450.16400.850.66412515030.97450.45450.23660.1738COMPLETE
5960.16400.850.67302514030.95960.45320.23570.1722COMPLETE
12610.16410.90.68372514030.912610.45090.23440.1730COMPLETE
2770.16410.850.65132512030.92770.46140.23800.1722COMPLETE
10980.16420.850.67882514030.910980.45060.23600.1712COMPLETE
6230.16420.850.67612514030.96230.45270.23670.1724COMPLETE
11490.16420.850.69352511030.911490.44340.23230.1734COMPLETE
6560.16420.850.66722520030.96560.45340.23370.1709COMPLETE
1650.16430.850.69102510030.91650.44440.23350.1729COMPLETE
2760.16430.850.65902512030.92760.45880.23790.1748COMPLETE
6250.16430.850.68262514030.96250.44990.23460.1717COMPLETE
5930.16460.850.67692513030.95930.45050.23740.1743COMPLETE
3610.16460.550.71142511030.93610.44590.23600.1746COMPLETE
4600.16470.850.66902513030.94600.45360.23590.1726COMPLETE

To create the final result, we set a minimum loss threshold of 0.166 and only used the 25 best models that ran to completion. Then we averaged the resulting class probabilities and used plurality voting to obtain final class predictions.
 
balanced accuracy score = 0.9551
accuracy score = 0.9553


4. Code

See Hyperparameter Search With Optuna: Part 2 – XGBoost Classification and Ensembling and the changes mentioned above.