1. Introduction
Optuna is a Python package for general function optimization. It also has specialized coding to integrate it with many popular machine learning packages to allow the use of pruning algorithms to make hyperparameter searching more efficient. In this article we use Optuna to optimize hyperparameters for Sci-kit Learn machine learning algorithms.
2. Optuna References
- Optuna homepage
- Preferred Networks homepage
- General Documentation
- Samplers (search algorithms)
- Code
- Sci-kit Learn example
- Paper – Optuna: A Next-generation Hyperparameter Optimization Framework
Preferred Networks created Optuna for internal use and then released it as open source software. As such, we hope that this implies long term support for the package.
3. Using Optuna With Sci-kit Learn
We demonstrate how to use Optuna with Sci-kit Learn by example. First, we create the following class:
class Objective(object): def __init__(self, xcalib, ycalib, error_type, cv_folds): self.xcalib = xcalib self.ycalib = ycalib self.error_type = error_type self.cv_folds = cv_folds def __call__(self, trial): list_trees = [25, 50, 75, 100, 125, 150, 175, 200, 225, 250] classifier_name = trial.suggest_categorical('classifier', ['etrees', 'randomforest']) if classifier_name == 'etrees': et_n_estimators = trial.suggest_categorical('et_n_estimators', list_trees) et_max_features = trial.suggest_uniform('et_max_features', 0.15, 1.0) et_min_samples_split = trial.suggest_int('et_min_samples_split', 2, 14) et_min_samples_leaf = trial.suggest_int('et_min_samples_leaf', 1, 14) et_max_samples = trial.suggest_uniform('et_max_samples', 0.6, 0.99) classifier_obj = ExtraTreesClassifier(n_estimators=et_n_estimators, max_features=et_max_features, min_samples_split=et_min_samples_split, min_samples_leaf=et_min_samples_leaf, max_samples=et_max_samples, bootstrap=True, n_jobs=-1, verbose=0) else: rf_n_estimators = trial.suggest_categorical('rf_n_estimators', list_trees) rf_max_features = trial.suggest_uniform('rf_max_features', 0.15, 1.0) rf_min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 14) rf_min_samples_leaf = trial.suggest_int('rf_min_samples_leaf', 1, 14) rf_max_samples = trial.suggest_uniform('rf_max_samples', 0.6, 0.99) classifier_obj = RandomForestClassifier(n_estimators=rf_n_estimators, max_features=rf_max_features, min_samples_split=rf_min_samples_split, min_samples_leaf=rf_min_samples_leaf, max_samples=rf_max_samples, bootstrap=True, n_jobs=-1, verbose=0) mean_cv_score = cross_val_score(classifier_obj, self.xcalib, self.ycalib, scoring=self.error_type, cv=self.cv_folds, n_jobs=-1).mean() return mean_cv_score
Optuna calls a specific set of hyperparameters and the subsequent function evaluation a trial. A set of trials is called a study (see below). Here we specify ranges of hyperparameters for the extra (extremely randomized) trees and random forest classification algorithms. trial.name is self explanatory. all such options can be found here. One quirk is that we use trial.suggest_categorical with a list of integers to specify possible number of trees because Optuna does not support a low, high, step function for integers as it does for floats (trial.suggest_discrete_uniform).
Note our use of ‘et_’ and ‘rf_’ as prefixes to names of parameters exposed to trials.suggest_discrete_uniform etc. Optuna allows us to save all actual parameter values used in a study into a Pandas DataFrame so these prefixes make it easier to know what parameters were used with specific algorithms.
In the main function, we have:
error_type = 'balanced_accuracy' optimizer_direction = 'maximize' cross_valid_folds = 3 number_of_random_points = 25 maximim_time = 60*60 # seconds objective = Objective(x_calib, y_calib, error_type, cross_valid_folds) optuna.logging.set_verbosity(optuna.logging.WARNING) study = optuna.create_study(direction=optimizer_direction, sampler=TPESampler(n_startup_trials=number_of_random_points)) study.optimize(objective, timeout=maximim_time) # save results df_results = study.trials_dataframe() df_results.to_pickle(results_directory + 'df_optuna_results.pkl') df_results.to_csv(results_directory + 'df_optuna_results.csv')
Optuna can be set to minimize or maximize the evaluation function.
sampler specifies the search algorithm to be used. We chose TPE (Tree-structured Parzen Estimator). We use number_of_random_points=25 to use random search as a primer for TPE.
In study.optimize() we specified the run time in seconds. Alternatively, we can set n_trials= to specify the total number of trials (number of sets of hyperparameters). For all options for this function, see here beginning at line 254.
study.trials_dataframe() is a Pandas DataFrame with all hyperparameter and function evaluation values. The DataFrame is below. We sorted it by balanced accuracy (column labeled value), only kept the top 25, and deleted columns with start and end times. The column designed ‘state’ refers to the use of pruning (stopping evaluation early), which we did not use here, but will do so in a later article. COMPLETE means that the result was not pruned.
Optuna Results DataFrame
number | value | params_classifier | params_et_max_features | params_et_max_samples | params_et_min_samples_leaf | params_et_min_samples_split | params_et_n_estimators | params_rf_max_features | params_rf_max_samples | params_rf_min_samples_leaf | params_rf_min_samples_split | params_rf_n_estimators | system_attrs__number | state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
278 | 278 | 0.9529 | etrees | 0.1504 | 0.9894 | 1 | 2 | 250 | 278 | COMPLETE | |||||
341 | 341 | 0.9528 | etrees | 0.1942 | 0.9897 | 1 | 2 | 250 | 341 | COMPLETE | |||||
207 | 207 | 0.9526 | etrees | 0.1663 | 0.9616 | 1 | 3 | 250 | 207 | COMPLETE | |||||
378 | 378 | 0.9526 | etrees | 0.1714 | 0.9807 | 1 | 2 | 250 | 378 | COMPLETE | |||||
388 | 388 | 0.9526 | etrees | 0.1525 | 0.9794 | 1 | 2 | 250 | 388 | COMPLETE | |||||
394 | 394 | 0.9526 | etrees | 0.1522 | 0.956 | 1 | 2 | 250 | 394 | COMPLETE | |||||
183 | 183 | 0.9525 | etrees | 0.1692 | 0.9684 | 1 | 2 | 175 | 183 | COMPLETE | |||||
305 | 305 | 0.9524 | etrees | 0.1531 | 0.9537 | 1 | 2 | 225 | 305 | COMPLETE | |||||
318 | 318 | 0.9524 | etrees | 0.2026 | 0.9716 | 1 | 2 | 250 | 318 | COMPLETE | |||||
192 | 192 | 0.9522 | etrees | 0.1671 | 0.9739 | 1 | 2 | 175 | 192 | COMPLETE | |||||
286 | 286 | 0.9522 | etrees | 0.1506 | 0.9768 | 1 | 2 | 250 | 286 | COMPLETE | |||||
326 | 326 | 0.9522 | etrees | 0.2176 | 0.969 | 1 | 2 | 250 | 326 | COMPLETE | |||||
345 | 345 | 0.9522 | etrees | 0.267 | 0.9887 | 1 | 2 | 250 | 345 | COMPLETE | |||||
380 | 380 | 0.9522 | etrees | 0.1508 | 0.9817 | 1 | 2 | 250 | 380 | COMPLETE | |||||
427 | 427 | 0.9522 | etrees | 0.1502 | 0.972 | 1 | 2 | 250 | 427 | COMPLETE | |||||
428 | 428 | 0.9522 | etrees | 0.1513 | 0.9757 | 1 | 2 | 250 | 428 | COMPLETE | |||||
444 | 444 | 0.9522 | etrees | 0.1502 | 0.9487 | 1 | 2 | 250 | 444 | COMPLETE | |||||
213 | 213 | 0.9521 | etrees | 0.2021 | 0.9744 | 1 | 2 | 250 | 213 | COMPLETE | |||||
432 | 432 | 0.9521 | etrees | 0.1501 | 0.9811 | 1 | 2 | 225 | 432 | COMPLETE | |||||
225 | 225 | 0.952 | etrees | 0.225 | 0.9823 | 1 | 2 | 250 | 225 | COMPLETE | |||||
302 | 302 | 0.952 | etrees | 0.1628 | 0.9527 | 1 | 2 | 250 | 302 | COMPLETE | |||||
309 | 309 | 0.952 | etrees | 0.1642 | 0.9581 | 1 | 2 | 225 | 309 | COMPLETE | |||||
172 | 172 | 0.9519 | etrees | 0.1694 | 0.9686 | 1 | 2 | 175 | 172 | COMPLETE | |||||
194 | 194 | 0.9519 | etrees | 0.1649 | 0.958 | 1 | 3 | 250 | 194 | COMPLETE | |||||
222 | 222 | 0.9519 | etrees | 0.22 | 0.9771 | 1 | 2 | 250 | 222 | COMPLETE |
Below is the function to read in the saved hyperparameters, fit models on all calibration data (we used half of the MNIST handwritten digits data set to reduce computation time, cross validation was used in the Objective class), and generate predictions with production (unseen) data. Note that we sort the DataFrame, apply a minimum balanced accuracy threshold (0.93), and apply a maximum number of models threshold (25). Ensembling is enabled by computing class probabilities, averaging them, and using plurality voting to obtain final class values. If you only want results for a single best set of parameters, you can sort the DataFrame and take values from the first row.
def make_final_predictions(xcalib, ycalib, xprod, yprod, list_class_names, models_directory, save_directory, save_models_flag, df_params, threshold, ml_name, num_models_accept, optimization_direction): if optimization_direction == 'maximize': df_params.sort_values(by='value', ascending=False, inplace=True) else: df_params.sort_values(by='value', ascending=True, inplace=True) # apply threshold accepted_models_num = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): if optimization_direction == 'maximize': bool1 = df_params.loc[df_params.index[i],'value'] > threshold else: bool1 = df_params.loc[df_params.index[i],'value'] < threshold bool2 = df_params.loc[df_params.index[i],'state'] == 'COMPLETE' bool3 = accepted_models_num < num_models_accept if bool1 and bool2 and bool3: model_name = df_params.loc[df_params.index[i],'params_classifier'] if model_name == 'etrees': ml_model = ExtraTreesClassifier( n_estimators=int(df_params.loc[df_params.index[i],'params_et_n_estimators']), max_features=df_params.loc[df_params.index[i],'params_et_max_features'], min_samples_split=int(df_params.loc[df_params.index[i],'params_et_min_samples_split']), min_samples_leaf=int(df_params.loc[df_params.index[i],'params_et_min_samples_leaf']), max_samples=df_params.loc[df_params.index[i],'params_et_max_samples'], bootstrap=True, n_jobs=-1, verbose=0) elif model_name == 'randomforest': ml_model = RandomForestClassifier( n_estimators=int(df_params.loc[df_params.index[i],'params_rf_n_estimators']), max_features=df_params.loc[df_params.index[i],'params_rf_max_features'], min_samples_split=int(df_params.loc[df_params.index[i],'params_rf_min_samples_split']), min_samples_leaf=int(df_params.loc[df_params.index[i],'params_rf_min_samples_leaf']), max_samples=df_params.loc[df_params.index[i],'params_rf_max_samples'], bootstrap=True, n_jobs=-1, verbose=0) else: print('\ncannot get correct model_name:',model_name) raise NameError ml_model.fit(xcalib, ycalib) list_predicted_prob.append(ml_model.predict_proba(xprod)) accepted_models_num = accepted_models_num + 1 if save_models_flag: number_string = str(df_params.loc[df_params.index[i],'number']) model_name = model_name + '_' + number_string + '_joblib.sav' dump(ml_model, save_directory + model_name) # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class = np.argmax(mean_probabilities, axis=1)
4. Results
balanced accuracy score = 0.9616
accuracy score = 0.9619
5. Code
Below is the code that has not already been shown in the sections above.
import os import numpy as np import pandas as pd import optuna from optuna.samplers import TPESampler from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from joblib import dump from sklearn.metrics import balanced_accuracy_score, accuracy_score from sklearn.metrics import confusion_matrix, classification_report import time from pathlib import Path import sys # ******************************************************************* if __name__ == '__main__': ml_algorithm_name = 'sklearn' file_name_stub = 'optuna_' + ml_algorithm_name calculation_type = 'production' #'calibration' 'production' base_directory = YOUR DIRECTORY data_directory = base_directory + 'data/' results_directory_stub = base_directory + file_name_stub + '/' if not Path(results_directory_stub).is_dir(): os.mkdir(results_directory_stub) # fixed parameters save_models = False threshold_error = 0.93 number_of_models = 25 error_type = 'balanced_accuracy' optimizer_direction = 'maximize' cross_valid_folds = 3 number_of_random_points = 25 # random searches to start opt process maximim_time = 60*60 # seconds # use small data set for illustration purposes x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy') y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy') print('\n*** starting at',pd.Timestamp.now()) start_time_total = time.time() # calibration if calculation_type == 'calibration': results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) objective = Objective(x_calib, y_calib, error_type, cross_valid_folds) optuna.logging.set_verbosity(optuna.logging.WARNING) study = optuna.create_study(direction=optimizer_direction, sampler=TPESampler(n_startup_trials=number_of_random_points)) study.optimize(objective, timeout=maximim_time) # save results df_results = study.trials_dataframe() df_results.to_pickle(results_directory + 'df_optuna_results.pkl') df_results.to_csv(results_directory + 'df_optuna_results.csv') elapsed_time_total = (time.time()-start_time_total)/60 print('\n\ntotal elapsed time =',elapsed_time_total,' minutes') elif calculation_type == 'production': # get optuna results parameters models_dir = results_directory_stub + 'calibration/' df_parameters = pd.read_pickle(models_dir + 'df_optuna_results.pkl') results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) x_prod = np.load(data_directory + 'x_mnist_production.npy') y_prod = np.load(data_directory + 'y_mnist_production.npy') num_classes = np.unique(y_prod).shape[0] class_names_list = [] for i in range(num_classes): class_names_list.append('class ' + str(i)) make_final_predictions(x_calib, y_calib, x_prod, y_prod, class_names_list, models_dir, results_directory, save_models, df_parameters, threshold_error, file_name_stub, number_of_models, optimizer_direction) else: print('\ninvalid calculation type:',calculation_type) raise NameError