Hyperparameter Search With Optuna: Part 1 - Scikit-learn Classification and Ensembling

1. Introduction

Optuna is a Python package for general function optimization. It also has specialized coding to integrate it with many popular machine learning packages to allow the use of pruning algorithms to make hyperparameter searching more efficient. In this article we use Optuna to optimize hyperparameters for Sci-kit Learn machine learning algorithms.

2. Optuna References

Preferred Networks created Optuna for internal use and then released it as open source software. As such, we hope that this implies long term support for the package.

3. Using Optuna With Sci-kit Learn

We demonstrate how to use Optuna with Sci-kit Learn by example. First, we create the following class:

class Objective(object):
    def __init__(self, xcalib, ycalib, error_type, cv_folds):
        self.xcalib = xcalib
        self.ycalib = ycalib
        self.error_type = error_type
        self.cv_folds = cv_folds

    def __call__(self, trial):
        
        list_trees = [25, 50, 75, 100, 125, 150, 175, 200, 225, 250]

        classifier_name = trial.suggest_categorical('classifier', ['etrees', 'randomforest'])
        if classifier_name == 'etrees':
            
            et_n_estimators = trial.suggest_categorical('et_n_estimators', list_trees)
            et_max_features = trial.suggest_uniform('et_max_features', 0.15, 1.0)
            et_min_samples_split = trial.suggest_int('et_min_samples_split', 2, 14)
            et_min_samples_leaf = trial.suggest_int('et_min_samples_leaf', 1, 14)
            et_max_samples = trial.suggest_uniform('et_max_samples', 0.6, 0.99)
                        
            classifier_obj = ExtraTreesClassifier(n_estimators=et_n_estimators,
                             max_features=et_max_features, min_samples_split=et_min_samples_split,
                             min_samples_leaf=et_min_samples_leaf, max_samples=et_max_samples,
                             bootstrap=True, n_jobs=-1, verbose=0)
        else:    
            rf_n_estimators = trial.suggest_categorical('rf_n_estimators', list_trees)
            rf_max_features = trial.suggest_uniform('rf_max_features', 0.15, 1.0)
            rf_min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 14)
            rf_min_samples_leaf = trial.suggest_int('rf_min_samples_leaf', 1, 14)
            rf_max_samples = trial.suggest_uniform('rf_max_samples', 0.6, 0.99)
            
            classifier_obj = RandomForestClassifier(n_estimators=rf_n_estimators,
                             max_features=rf_max_features, min_samples_split=rf_min_samples_split,
                             min_samples_leaf=rf_min_samples_leaf, max_samples=rf_max_samples,
                             bootstrap=True, n_jobs=-1, verbose=0)
                    
        mean_cv_score = cross_val_score(classifier_obj, self.xcalib, self.ycalib, 
                                        scoring=self.error_type, 
                                        cv=self.cv_folds, n_jobs=-1).mean()

        return mean_cv_score

Optuna calls a specific set of hyperparameters and the subsequent function evaluation a trial. A set of trials is called a study (see below). Here we specify ranges of hyperparameters for the extra (extremely randomized) trees and random forest classification algorithms. trial.name is self explanatory. all such options can be found here. One quirk is that we use trial.suggest_categorical with a list of integers to specify possible number of trees because Optuna does not support a low, high, step function for integers as it does for floats (trial.suggest_discrete_uniform).

Note our use of ‘et_’ and ‘rf_’ as prefixes to names of parameters exposed to trials.suggest_discrete_uniform etc. Optuna allows us to save all actual parameter values used in a study into a Pandas DataFrame so these prefixes make it easier to know what parameters were used with specific algorithms.

In the main function, we have:

error_type = 'balanced_accuracy'
optimizer_direction = 'maximize'
cross_valid_folds = 3
number_of_random_points = 25
maximim_time = 60*60  # seconds

objective = Objective(x_calib, y_calib, error_type, cross_valid_folds)

optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(direction=optimizer_direction,
        sampler=TPESampler(n_startup_trials=number_of_random_points))

study.optimize(objective, timeout=maximim_time)

# save results
df_results = study.trials_dataframe()
df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
df_results.to_csv(results_directory + 'df_optuna_results.csv')

Optuna can be set to minimize or maximize the evaluation function.

sampler specifies the search algorithm to be used. We chose TPE (Tree-structured Parzen Estimator). We use number_of_random_points=25 to use random search as a primer for TPE.

In study.optimize() we specified the run time in seconds. Alternatively, we can set n_trials= to specify the total number of trials (number of sets of hyperparameters). For all options for this function, see here beginning at line 254.

study.trials_dataframe() is a Pandas DataFrame with all hyperparameter and function evaluation values. The DataFrame is below. We sorted it by balanced accuracy (column labeled value), only kept the top 25, and deleted columns with start and end times. The column designed ‘state’ refers to the use of pruning (stopping evaluation early), which we did not use here, but will do so in a later article. COMPLETE means that the result was not pruned.

Optuna Results DataFrame

	number	value	params_classifier	params_et_max_features	params_et_max_samples	params_et_min_samples_leaf	params_et_min_samples_split	params_et_n_estimators	system_attrs__number	state
278	278	0.9529	etrees	0.1504	0.9894	1	2	250	278	COMPLETE
341	341	0.9528	etrees	0.1942	0.9897	1	2	250	341	COMPLETE
207	207	0.9526	etrees	0.1663	0.9616	1	3	250	207	COMPLETE
378	378	0.9526	etrees	0.1714	0.9807	1	2	250	378	COMPLETE
388	388	0.9526	etrees	0.1525	0.9794	1	2	250	388	COMPLETE
394	394	0.9526	etrees	0.1522	0.956	1	2	250	394	COMPLETE
183	183	0.9525	etrees	0.1692	0.9684	1	2	175	183	COMPLETE
305	305	0.9524	etrees	0.1531	0.9537	1	2	225	305	COMPLETE
318	318	0.9524	etrees	0.2026	0.9716	1	2	250	318	COMPLETE
192	192	0.9522	etrees	0.1671	0.9739	1	2	175	192	COMPLETE
286	286	0.9522	etrees	0.1506	0.9768	1	2	250	286	COMPLETE
326	326	0.9522	etrees	0.2176	0.969	1	2	250	326	COMPLETE
345	345	0.9522	etrees	0.267	0.9887	1	2	250	345	COMPLETE
380	380	0.9522	etrees	0.1508	0.9817	1	2	250	380	COMPLETE
427	427	0.9522	etrees	0.1502	0.972	1	2	250	427	COMPLETE
428	428	0.9522	etrees	0.1513	0.9757	1	2	250	428	COMPLETE
444	444	0.9522	etrees	0.1502	0.9487	1	2	250	444	COMPLETE
213	213	0.9521	etrees	0.2021	0.9744	1	2	250	213	COMPLETE
432	432	0.9521	etrees	0.1501	0.9811	1	2	225	432	COMPLETE
225	225	0.952	etrees	0.225	0.9823	1	2	250	225	COMPLETE
302	302	0.952	etrees	0.1628	0.9527	1	2	250	302	COMPLETE
309	309	0.952	etrees	0.1642	0.9581	1	2	225	309	COMPLETE
172	172	0.9519	etrees	0.1694	0.9686	1	2	175	172	COMPLETE
194	194	0.9519	etrees	0.1649	0.958	1	3	250	194	COMPLETE
222	222	0.9519	etrees	0.22	0.9771	1	2	250	222	COMPLETE

Below is the function to read in the saved hyperparameters, fit models on all calibration data (we used half of the MNIST handwritten digits data set to reduce computation time, cross validation was used in the Objective class), and generate predictions with production (unseen) data. Note that we sort the DataFrame, apply a minimum balanced accuracy threshold (0.93), and apply a maximum number of models threshold (25). Ensembling is enabled by computing class probabilities, averaging them, and using plurality voting to obtain final class values. If you only want results for a single best set of parameters, you can sort the DataFrame and take values from the first row.

def make_final_predictions(xcalib, ycalib, xprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, ml_name, num_models_accept, optimization_direction):
    
    if optimization_direction == 'maximize':
        df_params.sort_values(by='value', ascending=False, inplace=True)
    else:
        df_params.sort_values(by='value', ascending=True, inplace=True)
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        if optimization_direction == 'maximize':
            bool1 = df_params.loc[df_params.index[i],'value'] > threshold
        else:
            bool1 = df_params.loc[df_params.index[i],'value'] < threshold
                    
        bool2 = df_params.loc[df_params.index[i],'state'] == 'COMPLETE'
        bool3 = accepted_models_num < num_models_accept
        if bool1 and bool2 and bool3:
            model_name = df_params.loc[df_params.index[i],'params_classifier']
            
            if model_name == 'etrees':
                ml_model = ExtraTreesClassifier(
                n_estimators=int(df_params.loc[df_params.index[i],'params_et_n_estimators']),
                max_features=df_params.loc[df_params.index[i],'params_et_max_features'], 
                min_samples_split=int(df_params.loc[df_params.index[i],'params_et_min_samples_split']),
                min_samples_leaf=int(df_params.loc[df_params.index[i],'params_et_min_samples_leaf']),
                max_samples=df_params.loc[df_params.index[i],'params_et_max_samples'],
                bootstrap=True, n_jobs=-1, verbose=0)
                
            elif model_name == 'randomforest':
                ml_model = RandomForestClassifier(
                n_estimators=int(df_params.loc[df_params.index[i],'params_rf_n_estimators']),
                max_features=df_params.loc[df_params.index[i],'params_rf_max_features'], 
                min_samples_split=int(df_params.loc[df_params.index[i],'params_rf_min_samples_split']),
                min_samples_leaf=int(df_params.loc[df_params.index[i],'params_rf_min_samples_leaf']),
                max_samples=df_params.loc[df_params.index[i],'params_rf_max_samples'],
                bootstrap=True, n_jobs=-1, verbose=0)
                
            else:
                print('\ncannot get correct model_name:',model_name)
                raise NameError
            
            ml_model.fit(xcalib, ycalib)
            
            list_predicted_prob.append(ml_model.predict_proba(xprod))
            
            accepted_models_num = accepted_models_num + 1
            
            if save_models_flag:
                number_string = str(df_params.loc[df_params.index[i],'number'])
                model_name = model_name + '_' + number_string + '_joblib.sav'
                dump(ml_model, save_directory + model_name)

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)

4. Results

balanced accuracy score = 0.9616
accuracy score = 0.9619

5. Code

Below is the code that has not already been shown in the sections above.

import os
import numpy as np
import pandas as pd
import optuna
from optuna.samplers import TPESampler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from joblib import dump
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from pathlib import Path
import sys

# *******************************************************************

if __name__ == '__main__':    
    
    ml_algorithm_name = 'sklearn'
    file_name_stub = 'optuna_' + ml_algorithm_name 
    
    calculation_type = 'production' #'calibration' 'production'
    
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
        
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
               
    # fixed parameters
    save_models = False
    threshold_error = 0.93
    number_of_models = 25
    error_type = 'balanced_accuracy'
    optimizer_direction = 'maximize'
    cross_valid_folds = 3
    number_of_random_points = 25  # random searches to start opt process
    maximim_time = 60*60  # seconds
       
    # use small data set for illustration purposes
    x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
               
    print('\n*** starting at',pd.Timestamp.now())
    start_time_total = time.time()

    # calibration 
    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
                
        objective = Objective(x_calib, y_calib, error_type, cross_valid_folds)
    
        optuna.logging.set_verbosity(optuna.logging.WARNING)
        study = optuna.create_study(direction=optimizer_direction,
                sampler=TPESampler(n_startup_trials=number_of_random_points))
        
        study.optimize(objective, timeout=maximim_time)
        
        # save results
        df_results = study.trials_dataframe()
        df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
        df_results.to_csv(results_directory + 'df_optuna_results.csv')
        
        
        elapsed_time_total = (time.time()-start_time_total)/60
        print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
                          
    elif calculation_type == 'production':
        
        # get optuna results parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_optuna_results.pkl')
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                
        make_final_predictions(x_calib, y_calib, x_prod, y_prod, 
                               class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, file_name_stub,
                               number_of_models, optimizer_direction)
             
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError

Hyperparameter Search With Optuna: Part 1 – Scikit-learn Classification and Ensembling