1. Bayesian Optimization Python Package
  2. Using Bayesian Optimization
  3. Ensembling
  4. Results
  5. Remarks
  6. Code

1. Bayesian Optimization Python Package

Bayesian Optimization (BO) is a lightweight Python package for finding the parameters of an arbitrary function to maximize a given cost function. It is a

constrained global optimization package built upon bayesian inference and gaussian process, that attempts to find the maximum value of an unknown function in as few iterations as possible. This technique is particularly suited for optimization of high cost functions, situations where the balance between exploration and exploitation is important.

In this article, we demonstrate how to use this package to do hyperparameter search for a classification problem with Scikit-learn.

2. Using Bayesian Optimization

Below is a code fragment showing how to integrate the package with Scikit-learn

def etrees_crossval(n_estimators, max_features, min_samples_split, 
                    min_samples_leaf, max_samples):
    etrees = ExtraTreesClassifier(n_estimators=int(n_estimators),
             max_features=max_features, min_samples_split=int(min_samples_split),
             min_samples_leaf=int(min_samples_leaf), max_samples=max_samples,
             bootstrap=True, n_jobs=-1, verbose=0)
    
    mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, 
                                    cv=cv_folds, n_jobs=-1).mean()
    
    return mean_cv_score

optimizer = BayesianOptimization(f=etrees_crossval,
pbounds={'n_estimators': (25, 251),
         'max_features': (0.15, 1.0),
         'min_samples_split': (2, 14),
         'min_samples_leaf': (1, 14),
         'max_samples': (0.6, 0.99)},
         verbose=2)

optimizer.maximize(init_points=num_random_points, 
                   n_iter=num_iterations)

print('nbest result:', optimizer.max)

NOTE: BO maximizes the given function. Thus if you are attempting to minimize a log loss, multiply it by -1 or subtract it from 1 or …

BO requires that the user specify real valued, inclusive, bounds for each parameter. As there is no builtin support for integers or categorical values, the user must force a conversion, as you can see was done for n_estimators (number of decision trees). You can also see that for the maximum value for max_samples we used 0.99 as the ExtraTreesClassifier documentation requires this value to be strictly less than 1. Otherwise, we only have to be careful to use the actual names that appear in the Scikit-learn function call.

n_iter = the total number of sets of hyperparameters generated. init_points = the initial number of sets of hyperparameters generated by random search. Thus the difference equals the number of sets of hyperparameters generated via Bayesian inference and Gaussian processes. BO attempts to balance exploration, random search, and exploitation, guided search. These are controlled via parameters that can be set by the user. We only used the default values. See Section 3 in Advanced tour of the Bayesian Optimization package for how to access these parameters.

Setting verbose=2 will print each set of parameters to the screen during run time, showing the best result in purple.

To collect all results from BO, we have

list_dfs = []
counter = 0
for result in optimizer.res:
    df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                     columns=['trial' + str(counter)]).T
    df_temp[error_metric] = result['target']
    
    list_dfs.append(df_temp)
    
    counter = counter + 1
    
df_results = pd.concat(list_dfs, axis=0)
df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')

optimizer.res is a list of dictionaries. The code above puts each dictionary into a DataFrame then stacks them vertically into a single DataFrame. optimizer.max yields the best result (maximum target), it can also be obtained from the single DataFrame.

3. Ensembling

We used cross validation in the function to determine the hyperparameters found via BO. To obtain an ensemble, we apply a threshold on the cross validation error (we used balanced accuracy) and then refit ExtraTreesClassifier on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.

accepted_models_num = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    if df_params.loc[df_params.index[i],type_error] > threshold:
        ml_model = ExtraTreesClassifier(
        n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
        max_features=df_params.loc[df_params.index[i],'max_features'], 
        min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
        min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), 
        max_samples=df_params.loc[df_params.index[i],'max_samples'],
        bootstrap=True, n_jobs=-1, verbose=0)
        
        ml_model.fit(xcalib, ycalib)
        
        list_predicted_prob.append(ml_model.predict_proba(xprod))
        
        accepted_models_num = accepted_models_num + 1
        
        if save_models_flag:
            model_name = ml_name + df_params.index[i] + '_joblib.sav'
            dump(ml_model, save_directory + model_name)

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class = np.argmax(mean_probabilities, axis=1)

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.

ExtraTreesClassifier Hyperparameters (Sorted by Balanced Accuracy)

max_featuresmax_samplesmin_samples_leafmin_samples_splitn_estimatorsbalanced_accuracy
0.19610.92691.12272.1161207.96630.9523
0.21170.97771.19662.2298235.09820.9520
0.16880.97531.15052.5158196.87290.9516
0.15440.95991.00662.5087198.81680.9515
0.18000.95191.02542.9930214.48380.9515
0.18260.97561.14582.1729232.74610.9515
0.20070.92451.15702.0052208.97460.9515
0.15180.98821.34232.4815198.95880.9513
0.16400.98091.28982.3159229.95350.9512
0.20330.95471.19972.6057228.93840.9512
0.16110.98511.57942.0533192.60780.9512
0.15360.96631.47322.7149230.43360.9510
0.15820.97351.42212.5452137.71780.9509
0.15180.91291.06442.0604176.74220.9507
0.22280.98871.32702.9310225.74560.9505
0.16360.98341.06944.9818230.11850.9504
0.19040.96141.13562.9671182.67040.9504
0.21040.98781.04283.1741217.88620.9504
0.32270.91341.06052.0237228.71400.9504
0.15370.98771.03052.1220158.58680.9504
0.16000.93701.00302.4038236.61050.9504
0.16880.95401.00922.4244121.89580.9504
0.15560.97401.69352.0528138.16340.9503
0.20780.98211.02074.3096219.14310.9502
0.22380.96661.08002.3109217.94620.9502
0.17270.93061.15612.0065236.31140.9501
0.20260.93911.10602.1230189.25220.9501
0.25810.95711.00832.6091220.00870.9501
0.22060.97921.01543.0401158.75340.9499
0.17260.93111.00542.8431170.72840.9498
0.29780.97861.10172.1668160.03410.9495
0.24590.95921.04852.1825207.46090.9492
0.20930.81861.26652.0807154.54240.9486
0.26020.73881.06672.2214150.75490.9472
0.56970.94701.24772.0063137.29460.9470
0.31750.71801.73698.4912232.30520.9433
0.79020.61082.68422.4465193.74910.9376
0.59980.69921.037013.8980250.35990.9371
0.90310.87601.037013.3955217.84440.9365
0.66080.93484.489410.5830236.34240.9365
0.80720.96971.148813.622178.75240.9354
0.52680.72064.02169.8794126.59080.9338
0.58150.64184.88443.595698.52320.9302
0.20030.61041.964413.872725.01700.9294
0.73310.79441.01582.568525.63960.9263
0.15420.63228.19309.2273218.21380.9254
0.49700.852110.98069.6078237.15780.9207
0.79740.60098.93832.3354246.94110.9178
0.26730.717712.79939.2520174.01140.9159
0.34890.885011.40949.605444.22910.9142

balanced accuracy score = 0.9596
accuracy score = 0.9599
number of accepted models = 43 for threshold = 0.93


5. Remarks

Due to its ease of use, Bayesian Optimization can be considered as a drop in replacement for Scikit-learn’s random hyperparameter search. It should produce better hyperparameters and do so faster than pure random search, while at worse it is equivalent to random search.

We also note that while it would be nice to have builtin support for integer and categorical parameters, the creators of BO have opted for a lightweight approach that makes it easy to use and maintain. This latter aspect is important as other optimization packages have fallen by the wayside when abandoned by their creators. BO only uses Numpy, Scipy, and Scikit-learn. As these packages are usually upgraded in such a way as not to break older code, BO is less reliant on active maintenance than other packages with more bells and whistles.

6. Code

import os
import numpy as np
import pandas as pd
from bayes_opt import BayesianOptimization
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from joblib import dump
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from pathlib import Path
import sys
sys.path.append('/home/ubuntu/python/miscellaneous_library')
import confusion_matrix_plot
    
# *******************************************************************
    
def create_save_models_bayes_opt(x, y, error_metric, cv_folds,
    num_random_points, num_iterations, results_dir):
    
    start_time_total = time.time()
    
    def etrees_crossval(n_estimators, max_features, min_samples_split, 
                        min_samples_leaf, max_samples):
        etrees = ExtraTreesClassifier(n_estimators=int(n_estimators),
                 max_features=max_features, min_samples_split=int(min_samples_split),
                 min_samples_leaf=int(min_samples_leaf), max_samples=max_samples,
                 bootstrap=True, n_jobs=-1, verbose=0)
        
        mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, 
                                        cv=cv_folds, n_jobs=-1).mean()
        
        return mean_cv_score
       
    optimizer = BayesianOptimization(f=etrees_crossval,
    pbounds={'n_estimators': (25, 251),
             'max_features': (0.15, 1.0),
             'min_samples_split': (2, 14),
             'min_samples_leaf': (1, 14),
             'max_samples': (0.6, 0.99)},
             verbose=2)
    
    optimizer.maximize(init_points=num_random_points, 
                       n_iter=num_iterations)
    
    print('nbest result:', optimizer.max)
    
    elapsed_time_total = (time.time()-start_time_total)/60
    print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
    
    # optimizer.res is a list of dicts
    list_dfs = []
    counter = 0
    for result in optimizer.res:
        df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                         columns=['trial' + str(counter)]).T
        df_temp[error_metric] = result['target']
        
        list_dfs.append(df_temp)
        
        counter = counter + 1
        
    df_results = pd.concat(list_dfs, axis=0)
    df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
    df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')
        
# end of create_save_models_bayes_opt()
    
# *******************************************************************
            
def make_final_predictions(xcalib, ycalib, xprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, type_error, ml_name):
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        if df_params.loc[df_params.index[i],type_error] > threshold:
            ml_model = ExtraTreesClassifier(
            n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
            max_features=df_params.loc[df_params.index[i],'max_features'], 
            min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
            min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), 
            max_samples=df_params.loc[df_params.index[i],'max_samples'],
            bootstrap=True, n_jobs=-1, verbose=0)
            
            ml_model.fit(xcalib, ycalib)
            
            list_predicted_prob.append(ml_model.predict_proba(xprod))
            
            accepted_models_num = accepted_models_num + 1
            
            if save_models_flag:
                model_name = ml_name + df_params.index[i] + '_joblib.sav'
                dump(ml_model, save_directory + model_name)

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)
    
    # compute and save error measures

    # print info to file
    stdout_default = sys.stdout
    sys.stdout = open(save_directory + ml_name + '_prediction_results.txt','w')
    
    print('balanced accuracy score =',balanced_accuracy_score(yprod, y_predicted_class))
    
    print('accuracy score =',accuracy_score(yprod, y_predicted_class))
    
    print('number of accepted models =',accepted_models_num,' for threshold =',threshold)
    
    print('\nclassification report:')
    print(classification_report(yprod, y_predicted_class, digits=3, output_dict=False))
    
    print('\nraw confusion matrix:')
    cm_raw = confusion_matrix(yprod, y_predicted_class)
    print(cm_raw)
    
    print('\nconfusion matrix normalized by prediction:')
    cm_pred = confusion_matrix(yprod, y_predicted_class, normalize='pred')
    print(cm_pred)
    
    print('\nconfusion matrix normalized by true:')
    cm_true = confusion_matrix(yprod, y_predicted_class, normalize='true')
    print(cm_true)
    
    sys.stdout = stdout_default   
 
    # plot and save confustion matrices
    figure_size = (12, 8)
    number_of_decimals = 4
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_raw, 
    list_class_names, save_directory, 'Confusion Matrix', 
    ml_name + '_confusion_matrix', False, None, 30, figure_size,
    number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_pred, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Prediction', 
    ml_name + '_confusion_matrix_norm_by_prediction', False, 'pred', 
    30, figure_size, number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_true, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Actual', 
    ml_name + '_confusion_matrix_norm_by_true', False, 'true', 
    30, figure_size, number_of_decimals)

# end of make_final_predictions()
       
# *******************************************************************

if __name__ == '__main__':    
    
    ml_algorithm_name = 'etrees'
    file_name_stub = ml_algorithm_name + '_bayes_opt'  
    
    calculation_type = 'production' #'calibration' 'production'
    
    data_directory = YOUR DIRECTORY
    
    base_directory = YOUR DIRECTORY
    
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
                
    # fixed parameters
    error_type = 'balanced_accuracy'
    threshold_error = 0.93
    cross_valid_folds = 3
    total_number_of_iterations = 50
    number_of_random_points = 10  # random searches to start opt process
    # this is # of bayes iters, thus total=this + # of random pts
    number_of_iterations =  total_number_of_iterations - number_of_random_points
    save_models = False
           
    # use small data set
    x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
                
    print('\n*** starting at',pd.Timestamp.now())

    # 1 - calibration - using cross validation, get cv score for the given set of
    # parameters via bayes opt search
    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
        
        create_save_models_bayes_opt(x_calib, y_calib, error_type, 
                                     cross_valid_folds, 
                                     number_of_random_points, number_of_iterations,
                                     results_directory)

    # 2 - production - apply threshold
    elif calculation_type == 'production':
        
        # get etrees parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_bayes_opt_results_parameters.pkl')
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                
        make_final_predictions(x_calib, y_calib, x_prod, y_prod, 
                               class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, error_type, ml_algorithm_name)
               
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError