1. Introduction
  2. Optuna References
  3. Using Optuna With Sci-kit Learn
  4. Results
  5. Code

1. Introduction

Optuna is a Python package for general function optimization. It also has specialized coding to integrate it with many popular machine learning packages to allow the use of pruning algorithms to make hyperparameter searching more efficient. In this article we use Optuna to optimize hyperparameters for Sci-kit Learn machine learning algorithms.

2. Optuna References

Preferred Networks created Optuna for internal use and then released it as open source software. As such, we hope that this implies long term support for the package.

3. Using Optuna With Sci-kit Learn

We demonstrate how to use Optuna with Sci-kit Learn by example. First, we create the following class:

class Objective(object):
    def __init__(self, xcalib, ycalib, error_type, cv_folds):
        self.xcalib = xcalib
        self.ycalib = ycalib
        self.error_type = error_type
        self.cv_folds = cv_folds

    def __call__(self, trial):
        
        list_trees = [25, 50, 75, 100, 125, 150, 175, 200, 225, 250]

        classifier_name = trial.suggest_categorical('classifier', ['etrees', 'randomforest'])
        if classifier_name == 'etrees':
            
            et_n_estimators = trial.suggest_categorical('et_n_estimators', list_trees)
            et_max_features = trial.suggest_uniform('et_max_features', 0.15, 1.0)
            et_min_samples_split = trial.suggest_int('et_min_samples_split', 2, 14)
            et_min_samples_leaf = trial.suggest_int('et_min_samples_leaf', 1, 14)
            et_max_samples = trial.suggest_uniform('et_max_samples', 0.6, 0.99)
                        
            classifier_obj = ExtraTreesClassifier(n_estimators=et_n_estimators,
                             max_features=et_max_features, min_samples_split=et_min_samples_split,
                             min_samples_leaf=et_min_samples_leaf, max_samples=et_max_samples,
                             bootstrap=True, n_jobs=-1, verbose=0)
        else:    
            rf_n_estimators = trial.suggest_categorical('rf_n_estimators', list_trees)
            rf_max_features = trial.suggest_uniform('rf_max_features', 0.15, 1.0)
            rf_min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 14)
            rf_min_samples_leaf = trial.suggest_int('rf_min_samples_leaf', 1, 14)
            rf_max_samples = trial.suggest_uniform('rf_max_samples', 0.6, 0.99)
            
            classifier_obj = RandomForestClassifier(n_estimators=rf_n_estimators,
                             max_features=rf_max_features, min_samples_split=rf_min_samples_split,
                             min_samples_leaf=rf_min_samples_leaf, max_samples=rf_max_samples,
                             bootstrap=True, n_jobs=-1, verbose=0)
                    
        mean_cv_score = cross_val_score(classifier_obj, self.xcalib, self.ycalib, 
                                        scoring=self.error_type, 
                                        cv=self.cv_folds, n_jobs=-1).mean()

        return mean_cv_score

Optuna calls a specific set of hyperparameters and the subsequent function evaluation a trial. A set of trials is called a study (see below). Here we specify ranges of hyperparameters for the extra (extremely randomized) trees and random forest classification algorithms. trial.name is self explanatory. all such options can be found here. One quirk is that we use trial.suggest_categorical with a list of integers to specify possible number of trees because Optuna does not support a low, high, step function for integers as it does for floats (trial.suggest_discrete_uniform).

Note our use of ‘et_’ and ‘rf_’ as prefixes to names of parameters exposed to trials.suggest_discrete_uniform etc. Optuna allows us to save all actual parameter values used in a study into a Pandas DataFrame so these prefixes make it easier to know what parameters were used with specific algorithms.

In the main function, we have:

error_type = 'balanced_accuracy'
optimizer_direction = 'maximize'
cross_valid_folds = 3
number_of_random_points = 25
maximim_time = 60*60  # seconds

objective = Objective(x_calib, y_calib, error_type, cross_valid_folds)

optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(direction=optimizer_direction,
        sampler=TPESampler(n_startup_trials=number_of_random_points))

study.optimize(objective, timeout=maximim_time)

# save results
df_results = study.trials_dataframe()
df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
df_results.to_csv(results_directory + 'df_optuna_results.csv')

Optuna can be set to minimize or maximize the evaluation function.

sampler specifies the search algorithm to be used. We chose TPE (Tree-structured Parzen Estimator). We use number_of_random_points=25 to use random search as a primer for TPE.

In study.optimize() we specified the run time in seconds. Alternatively, we can set n_trials= to specify the total number of trials (number of sets of hyperparameters). For all options for this function, see here beginning at line 254.

study.trials_dataframe() is a Pandas DataFrame with all hyperparameter and function evaluation values. The DataFrame is below. We sorted it by balanced accuracy (column labeled value), only kept the top 25, and deleted columns with start and end times. The column designed ‘state’ refers to the use of pruning (stopping evaluation early), which we did not use here, but will do so in a later article. COMPLETE means that the result was not pruned.

Optuna Results DataFrame

numbervalueparams_classifierparams_et_max_featuresparams_et_max_samplesparams_et_min_samples_leafparams_et_min_samples_splitparams_et_n_estimatorsparams_rf_max_featuresparams_rf_max_samplesparams_rf_min_samples_leafparams_rf_min_samples_splitparams_rf_n_estimatorssystem_attrs__numberstate
2782780.9529etrees0.15040.989412250278COMPLETE
3413410.9528etrees0.19420.989712250341COMPLETE
2072070.9526etrees0.16630.961613250207COMPLETE
3783780.9526etrees0.17140.980712250378COMPLETE
3883880.9526etrees0.15250.979412250388COMPLETE
3943940.9526etrees0.15220.95612250394COMPLETE
1831830.9525etrees0.16920.968412175183COMPLETE
3053050.9524etrees0.15310.953712225305COMPLETE
3183180.9524etrees0.20260.971612250318COMPLETE
1921920.9522etrees0.16710.973912175192COMPLETE
2862860.9522etrees0.15060.976812250286COMPLETE
3263260.9522etrees0.21760.96912250326COMPLETE
3453450.9522etrees0.2670.988712250345COMPLETE
3803800.9522etrees0.15080.981712250380COMPLETE
4274270.9522etrees0.15020.97212250427COMPLETE
4284280.9522etrees0.15130.975712250428COMPLETE
4444440.9522etrees0.15020.948712250444COMPLETE
2132130.9521etrees0.20210.974412250213COMPLETE
4324320.9521etrees0.15010.981112225432COMPLETE
2252250.952etrees0.2250.982312250225COMPLETE
3023020.952etrees0.16280.952712250302COMPLETE
3093090.952etrees0.16420.958112225309COMPLETE
1721720.9519etrees0.16940.968612175172COMPLETE
1941940.9519etrees0.16490.95813250194COMPLETE
2222220.9519etrees0.220.977112250222COMPLETE

Below is the function to read in the saved hyperparameters, fit models on all calibration data (we used half of the MNIST handwritten digits data set to reduce computation time, cross validation was used in the Objective class), and generate predictions with production (unseen) data. Note that we sort the DataFrame, apply a minimum balanced accuracy threshold (0.93), and apply a maximum number of models threshold (25). Ensembling is enabled by computing class probabilities, averaging them, and using plurality voting to obtain final class values. If you only want results for a single best set of parameters, you can sort the DataFrame and take values from the first row.

def make_final_predictions(xcalib, ycalib, xprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, ml_name, num_models_accept, optimization_direction):
    
    if optimization_direction == 'maximize':
        df_params.sort_values(by='value', ascending=False, inplace=True)
    else:
        df_params.sort_values(by='value', ascending=True, inplace=True)
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        if optimization_direction == 'maximize':
            bool1 = df_params.loc[df_params.index[i],'value'] > threshold
        else:
            bool1 = df_params.loc[df_params.index[i],'value'] < threshold
                    
        bool2 = df_params.loc[df_params.index[i],'state'] == 'COMPLETE'
        bool3 = accepted_models_num < num_models_accept
        if bool1 and bool2 and bool3:
            model_name = df_params.loc[df_params.index[i],'params_classifier']
            
            if model_name == 'etrees':
                ml_model = ExtraTreesClassifier(
                n_estimators=int(df_params.loc[df_params.index[i],'params_et_n_estimators']),
                max_features=df_params.loc[df_params.index[i],'params_et_max_features'], 
                min_samples_split=int(df_params.loc[df_params.index[i],'params_et_min_samples_split']),
                min_samples_leaf=int(df_params.loc[df_params.index[i],'params_et_min_samples_leaf']),
                max_samples=df_params.loc[df_params.index[i],'params_et_max_samples'],
                bootstrap=True, n_jobs=-1, verbose=0)
                
            elif model_name == 'randomforest':
                ml_model = RandomForestClassifier(
                n_estimators=int(df_params.loc[df_params.index[i],'params_rf_n_estimators']),
                max_features=df_params.loc[df_params.index[i],'params_rf_max_features'], 
                min_samples_split=int(df_params.loc[df_params.index[i],'params_rf_min_samples_split']),
                min_samples_leaf=int(df_params.loc[df_params.index[i],'params_rf_min_samples_leaf']),
                max_samples=df_params.loc[df_params.index[i],'params_rf_max_samples'],
                bootstrap=True, n_jobs=-1, verbose=0)
                
            else:
                print('\ncannot get correct model_name:',model_name)
                raise NameError
            
            ml_model.fit(xcalib, ycalib)
            
            list_predicted_prob.append(ml_model.predict_proba(xprod))
            
            accepted_models_num = accepted_models_num + 1
            
            if save_models_flag:
                number_string = str(df_params.loc[df_params.index[i],'number'])
                model_name = model_name + '_' + number_string + '_joblib.sav'
                dump(ml_model, save_directory + model_name)

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)

4. Results

balanced accuracy score = 0.9616
accuracy score = 0.9619


5. Code

Below is the code that has not already been shown in the sections above.

import os
import numpy as np
import pandas as pd
import optuna
from optuna.samplers import TPESampler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from joblib import dump
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from pathlib import Path
import sys

# *******************************************************************

if __name__ == '__main__':    
    
    ml_algorithm_name = 'sklearn'
    file_name_stub = 'optuna_' + ml_algorithm_name 
    
    calculation_type = 'production' #'calibration' 'production'
    
    base_directory = YOUR DIRECTORY
    data_directory = base_directory + 'data/'
        
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
               
    # fixed parameters
    save_models = False
    threshold_error = 0.93
    number_of_models = 25
    error_type = 'balanced_accuracy'
    optimizer_direction = 'maximize'
    cross_valid_folds = 3
    number_of_random_points = 25  # random searches to start opt process
    maximim_time = 60*60  # seconds
       
    # use small data set for illustration purposes
    x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
               
    print('\n*** starting at',pd.Timestamp.now())
    start_time_total = time.time()

    # calibration 
    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
                
        objective = Objective(x_calib, y_calib, error_type, cross_valid_folds)
    
        optuna.logging.set_verbosity(optuna.logging.WARNING)
        study = optuna.create_study(direction=optimizer_direction,
                sampler=TPESampler(n_startup_trials=number_of_random_points))
        
        study.optimize(objective, timeout=maximim_time)
        
        # save results
        df_results = study.trials_dataframe()
        df_results.to_pickle(results_directory + 'df_optuna_results.pkl')
        df_results.to_csv(results_directory + 'df_optuna_results.csv')
        
        
        elapsed_time_total = (time.time()-start_time_total)/60
        print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
                          
    elif calculation_type == 'production':
        
        # get optuna results parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_optuna_results.pkl')
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                
        make_final_predictions(x_calib, y_calib, x_prod, y_prod, 
                               class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, file_name_stub,
                               number_of_models, optimizer_direction)
             
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError