Hyperparameter Search With Bayesian Optimization for Scikit-learn Classification and Ensembling

1. Bayesian Optimization Python Package

Bayesian Optimization (BO) is a lightweight Python package for finding the parameters of an arbitrary function to maximize a given cost function. It is a

constrained global optimization package built upon bayesian inference and gaussian process, that attempts to find the maximum value of an unknown function in as few iterations as possible. This technique is particularly suited for optimization of high cost functions, situations where the balance between exploration and exploitation is important.

In this article, we demonstrate how to use this package to do hyperparameter search for a classification problem with Scikit-learn.

2. Using Bayesian Optimization

Below is a code fragment showing how to integrate the package with Scikit-learn

def etrees_crossval(n_estimators, max_features, min_samples_split, 
                    min_samples_leaf, max_samples):
    etrees = ExtraTreesClassifier(n_estimators=int(n_estimators),
             max_features=max_features, min_samples_split=int(min_samples_split),
             min_samples_leaf=int(min_samples_leaf), max_samples=max_samples,
             bootstrap=True, n_jobs=-1, verbose=0)
    
    mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, 
                                    cv=cv_folds, n_jobs=-1).mean()
    
    return mean_cv_score

optimizer = BayesianOptimization(f=etrees_crossval,
pbounds={'n_estimators': (25, 251),
         'max_features': (0.15, 1.0),
         'min_samples_split': (2, 14),
         'min_samples_leaf': (1, 14),
         'max_samples': (0.6, 0.99)},
         verbose=2)

optimizer.maximize(init_points=num_random_points, 
                   n_iter=num_iterations)

print('nbest result:', optimizer.max)

NOTE: BO maximizes the given function. Thus if you are attempting to minimize a log loss, multiply it by -1 or subtract it from 1 or …

BO requires that the user specify real valued, inclusive, bounds for each parameter. As there is no builtin support for integers or categorical values, the user must force a conversion, as you can see was done for n_estimators (number of decision trees). You can also see that for the maximum value for max_samples we used 0.99 as the ExtraTreesClassifier documentation requires this value to be strictly less than 1. Otherwise, we only have to be careful to use the actual names that appear in the Scikit-learn function call.

n_iter = the total number of sets of hyperparameters generated. init_points = the initial number of sets of hyperparameters generated by random search. Thus the difference equals the number of sets of hyperparameters generated via Bayesian inference and Gaussian processes. BO attempts to balance exploration, random search, and exploitation, guided search. These are controlled via parameters that can be set by the user. We only used the default values. See Section 3 in Advanced tour of the Bayesian Optimization package for how to access these parameters.

Setting verbose=2 will print each set of parameters to the screen during run time, showing the best result in purple.

To collect all results from BO, we have

list_dfs = []
counter = 0
for result in optimizer.res:
    df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                     columns=['trial' + str(counter)]).T
    df_temp[error_metric] = result['target']
    
    list_dfs.append(df_temp)
    
    counter = counter + 1
    
df_results = pd.concat(list_dfs, axis=0)
df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')

optimizer.res is a list of dictionaries. The code above puts each dictionary into a DataFrame then stacks them vertically into a single DataFrame. optimizer.max yields the best result (maximum target), it can also be obtained from the single DataFrame.

3. Ensembling

We used cross validation in the function to determine the hyperparameters found via BO. To obtain an ensemble, we apply a threshold on the cross validation error (we used balanced accuracy) and then refit ExtraTreesClassifier on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting.

accepted_models_num = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
    if df_params.loc[df_params.index[i],type_error] > threshold:
        ml_model = ExtraTreesClassifier(
        n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
        max_features=df_params.loc[df_params.index[i],'max_features'], 
        min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
        min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), 
        max_samples=df_params.loc[df_params.index[i],'max_samples'],
        bootstrap=True, n_jobs=-1, verbose=0)
        
        ml_model.fit(xcalib, ycalib)
        
        list_predicted_prob.append(ml_model.predict_proba(xprod))
        
        accepted_models_num = accepted_models_num + 1
        
        if save_models_flag:
            model_name = ml_name + df_params.index[i] + '_joblib.sav'
            dump(ml_model, save_directory + model_name)

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class = np.argmax(mean_probabilities, axis=1)

4. Results

We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.

ExtraTreesClassifier Hyperparameters (Sorted by Balanced Accuracy)

max_features	max_samples	min_samples_leaf	min_samples_split	n_estimators	balanced_accuracy
0.1961	0.9269	1.1227	2.1161	207.9663	0.9523
0.2117	0.9777	1.1966	2.2298	235.0982	0.9520
0.1688	0.9753	1.1505	2.5158	196.8729	0.9516
0.1544	0.9599	1.0066	2.5087	198.8168	0.9515
0.1800	0.9519	1.0254	2.9930	214.4838	0.9515
0.1826	0.9756	1.1458	2.1729	232.7461	0.9515
0.2007	0.9245	1.1570	2.0052	208.9746	0.9515
0.1518	0.9882	1.3423	2.4815	198.9588	0.9513
0.1640	0.9809	1.2898	2.3159	229.9535	0.9512
0.2033	0.9547	1.1997	2.6057	228.9384	0.9512
0.1611	0.9851	1.5794	2.0533	192.6078	0.9512
0.1536	0.9663	1.4732	2.7149	230.4336	0.9510
0.1582	0.9735	1.4221	2.5452	137.7178	0.9509
0.1518	0.9129	1.0644	2.0604	176.7422	0.9507
0.2228	0.9887	1.3270	2.9310	225.7456	0.9505
0.1636	0.9834	1.0694	4.9818	230.1185	0.9504
0.1904	0.9614	1.1356	2.9671	182.6704	0.9504
0.2104	0.9878	1.0428	3.1741	217.8862	0.9504
0.3227	0.9134	1.0605	2.0237	228.7140	0.9504
0.1537	0.9877	1.0305	2.1220	158.5868	0.9504
0.1600	0.9370	1.0030	2.4038	236.6105	0.9504
0.1688	0.9540	1.0092	2.4244	121.8958	0.9504
0.1556	0.9740	1.6935	2.0528	138.1634	0.9503
0.2078	0.9821	1.0207	4.3096	219.1431	0.9502
0.2238	0.9666	1.0800	2.3109	217.9462	0.9502
0.1727	0.9306	1.1561	2.0065	236.3114	0.9501
0.2026	0.9391	1.1060	2.1230	189.2522	0.9501
0.2581	0.9571	1.0083	2.6091	220.0087	0.9501
0.2206	0.9792	1.0154	3.0401	158.7534	0.9499
0.1726	0.9311	1.0054	2.8431	170.7284	0.9498
0.2978	0.9786	1.1017	2.1668	160.0341	0.9495
0.2459	0.9592	1.0485	2.1825	207.4609	0.9492
0.2093	0.8186	1.2665	2.0807	154.5424	0.9486
0.2602	0.7388	1.0667	2.2214	150.7549	0.9472
0.5697	0.9470	1.2477	2.0063	137.2946	0.9470
0.3175	0.7180	1.7369	8.4912	232.3052	0.9433
0.7902	0.6108	2.6842	2.4465	193.7491	0.9376
0.5998	0.6992	1.0370	13.8980	250.3599	0.9371
0.9031	0.8760	1.0370	13.3955	217.8444	0.9365
0.6608	0.9348	4.4894	10.5830	236.3424	0.9365
0.8072	0.9697	1.1488	13.6221	78.7524	0.9354
0.5268	0.7206	4.0216	9.8794	126.5908	0.9338
0.5815	0.6418	4.8844	3.5956	98.5232	0.9302
0.2003	0.6104	1.9644	13.8727	25.0170	0.9294
0.7331	0.7944	1.0158	2.5685	25.6396	0.9263
0.1542	0.6322	8.1930	9.2273	218.2138	0.9254
0.4970	0.8521	10.9806	9.6078	237.1578	0.9207
0.7974	0.6009	8.9383	2.3354	246.9411	0.9178
0.2673	0.7177	12.7993	9.2520	174.0114	0.9159
0.3489	0.8850	11.4094	9.6054	44.2291	0.9142

balanced accuracy score = 0.9596
accuracy score = 0.9599
number of accepted models = 43 for threshold = 0.93

5. Remarks

Due to its ease of use, Bayesian Optimization can be considered as a drop in replacement for Scikit-learn’s random hyperparameter search. It should produce better hyperparameters and do so faster than pure random search, while at worse it is equivalent to random search.

We also note that while it would be nice to have builtin support for integer and categorical parameters, the creators of BO have opted for a lightweight approach that makes it easy to use and maintain. This latter aspect is important as other optimization packages have fallen by the wayside when abandoned by their creators. BO only uses Numpy, Scipy, and Scikit-learn. As these packages are usually upgraded in such a way as not to break older code, BO is less reliant on active maintenance than other packages with more bells and whistles.

6. Code

import os
import numpy as np
import pandas as pd
from bayes_opt import BayesianOptimization
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from joblib import dump
from sklearn.metrics import balanced_accuracy_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
import time
from pathlib import Path
import sys
sys.path.append('/home/ubuntu/python/miscellaneous_library')
import confusion_matrix_plot
    
# *******************************************************************
    
def create_save_models_bayes_opt(x, y, error_metric, cv_folds,
    num_random_points, num_iterations, results_dir):
    
    start_time_total = time.time()
    
    def etrees_crossval(n_estimators, max_features, min_samples_split, 
                        min_samples_leaf, max_samples):
        etrees = ExtraTreesClassifier(n_estimators=int(n_estimators),
                 max_features=max_features, min_samples_split=int(min_samples_split),
                 min_samples_leaf=int(min_samples_leaf), max_samples=max_samples,
                 bootstrap=True, n_jobs=-1, verbose=0)
        
        mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, 
                                        cv=cv_folds, n_jobs=-1).mean()
        
        return mean_cv_score
       
    optimizer = BayesianOptimization(f=etrees_crossval,
    pbounds={'n_estimators': (25, 251),
             'max_features': (0.15, 1.0),
             'min_samples_split': (2, 14),
             'min_samples_leaf': (1, 14),
             'max_samples': (0.6, 0.99)},
             verbose=2)
    
    optimizer.maximize(init_points=num_random_points, 
                       n_iter=num_iterations)
    
    print('nbest result:', optimizer.max)
    
    elapsed_time_total = (time.time()-start_time_total)/60
    print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')
    
    # optimizer.res is a list of dicts
    list_dfs = []
    counter = 0
    for result in optimizer.res:
        df_temp = pd.DataFrame.from_dict(data=result['params'], orient='index',
                                         columns=['trial' + str(counter)]).T
        df_temp[error_metric] = result['target']
        
        list_dfs.append(df_temp)
        
        counter = counter + 1
        
    df_results = pd.concat(list_dfs, axis=0)
    df_results.to_pickle(results_dir + 'df_bayes_opt_results_parameters.pkl')
    df_results.to_csv(results_dir + 'df_bayes_opt_results_parameters.csv')
        
# end of create_save_models_bayes_opt()
    
# *******************************************************************
            
def make_final_predictions(xcalib, ycalib, xprod, yprod, 
                           list_class_names, models_directory, 
                           save_directory, save_models_flag, df_params,
                           threshold, type_error, ml_name):
    
    # apply threshold
    accepted_models_num = 0
    list_predicted_prob = []
    num_models = df_params.shape[0]
    for i in range(num_models):
        if df_params.loc[df_params.index[i],type_error] > threshold:
            ml_model = ExtraTreesClassifier(
            n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
            max_features=df_params.loc[df_params.index[i],'max_features'], 
            min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
            min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), 
            max_samples=df_params.loc[df_params.index[i],'max_samples'],
            bootstrap=True, n_jobs=-1, verbose=0)
            
            ml_model.fit(xcalib, ycalib)
            
            list_predicted_prob.append(ml_model.predict_proba(xprod))
            
            accepted_models_num = accepted_models_num + 1
            
            if save_models_flag:
                model_name = ml_name + df_params.index[i] + '_joblib.sav'
                dump(ml_model, save_directory + model_name)

    # compute mean probabilities
    mean_probabilities = np.mean(list_predicted_prob, axis=0)
    
    # compute predicted class
    # argmax uses 1st ocurrance in case of a tie
    y_predicted_class = np.argmax(mean_probabilities, axis=1)
    
    # compute and save error measures

    # print info to file
    stdout_default = sys.stdout
    sys.stdout = open(save_directory + ml_name + '_prediction_results.txt','w')
    
    print('balanced accuracy score =',balanced_accuracy_score(yprod, y_predicted_class))
    
    print('accuracy score =',accuracy_score(yprod, y_predicted_class))
    
    print('number of accepted models =',accepted_models_num,' for threshold =',threshold)
    
    print('\nclassification report:')
    print(classification_report(yprod, y_predicted_class, digits=3, output_dict=False))
    
    print('\nraw confusion matrix:')
    cm_raw = confusion_matrix(yprod, y_predicted_class)
    print(cm_raw)
    
    print('\nconfusion matrix normalized by prediction:')
    cm_pred = confusion_matrix(yprod, y_predicted_class, normalize='pred')
    print(cm_pred)
    
    print('\nconfusion matrix normalized by true:')
    cm_true = confusion_matrix(yprod, y_predicted_class, normalize='true')
    print(cm_true)
    
    sys.stdout = stdout_default   
 
    # plot and save confustion matrices
    figure_size = (12, 8)
    number_of_decimals = 4
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_raw, 
    list_class_names, save_directory, 'Confusion Matrix', 
    ml_name + '_confusion_matrix', False, None, 30, figure_size,
    number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_pred, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Prediction', 
    ml_name + '_confusion_matrix_norm_by_prediction', False, 'pred', 
    30, figure_size, number_of_decimals)
    
    confusion_matrix_plot.confusion_matrix_save_and_plot(cm_true, 
    list_class_names, save_directory, 'Confusion Matrix Normalized by Actual', 
    ml_name + '_confusion_matrix_norm_by_true', False, 'true', 
    30, figure_size, number_of_decimals)

# end of make_final_predictions()
       
# *******************************************************************

if __name__ == '__main__':    
    
    ml_algorithm_name = 'etrees'
    file_name_stub = ml_algorithm_name + '_bayes_opt'  
    
    calculation_type = 'production' #'calibration' 'production'
    
    data_directory = YOUR DIRECTORY
    
    base_directory = YOUR DIRECTORY
    
    results_directory_stub = base_directory + file_name_stub + '/'
    if not Path(results_directory_stub).is_dir():
        os.mkdir(results_directory_stub)
                
    # fixed parameters
    error_type = 'balanced_accuracy'
    threshold_error = 0.93
    cross_valid_folds = 3
    total_number_of_iterations = 50
    number_of_random_points = 10  # random searches to start opt process
    # this is # of bayes iters, thus total=this + # of random pts
    number_of_iterations =  total_number_of_iterations - number_of_random_points
    save_models = False
           
    # use small data set
    x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')        
    y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')
                
    print('\n*** starting at',pd.Timestamp.now())

    # 1 - calibration - using cross validation, get cv score for the given set of
    # parameters via bayes opt search
    if calculation_type == 'calibration':
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
        
        create_save_models_bayes_opt(x_calib, y_calib, error_type, 
                                     cross_valid_folds, 
                                     number_of_random_points, number_of_iterations,
                                     results_directory)

    # 2 - production - apply threshold
    elif calculation_type == 'production':
        
        # get etrees parameters
        models_dir = results_directory_stub + 'calibration/'
        df_parameters = pd.read_pickle(models_dir + 'df_bayes_opt_results_parameters.pkl')
        
        results_directory = results_directory_stub + calculation_type + '/'
        if not Path(results_directory).is_dir():
            os.mkdir(results_directory)
            
        x_prod = np.load(data_directory + 'x_mnist_production.npy')
        y_prod = np.load(data_directory + 'y_mnist_production.npy')
        
        num_classes = np.unique(y_prod).shape[0]
        class_names_list = []
        for i in range(num_classes):
            class_names_list.append('class ' + str(i))
                
        make_final_predictions(x_calib, y_calib, x_prod, y_prod, 
                               class_names_list, 
                               models_dir, results_directory, 
                               save_models, df_parameters, 
                               threshold_error, error_type, ml_algorithm_name)
               
    else:
        print('\ninvalid calculation type:',calculation_type)
        raise NameError