1. GPyOpt Python Package
GPyOpt is a Python open-source library for Bayesian Optimization developed by the Machine Learning group of the University of Sheffield. It is based on GPy, a Python framework for Gaussian process modelling.
In this article, we demonstrate how to use this package to do hyperparameter search for a classification problem with Scikit-learn.
2. Using GPyOpt
Below are code fragments showing how to integrate the package with Scikit-learn.
We begin by specifying if the problem is one of minimization or maximization. Here, we are again using the ExtraTreesClassifier for the MNIST handwritten digits data set so we wish to maximize the balanced accuracy.
if error_metric in ['balanced_accuracy', 'accuracy']: maximize_bool = True else: raise NameError
Then we set the search space as a single element list of dictionaries in which we label the variable names, set variable types to continuous or discrete, and specify their range of values.
search_space = [ {'name':'max_features', 'type':'continuous', 'domain':(0.15, 1.0)}, {'name':'max_samples', 'type':'continuous', 'domain':(0.6, 0.99)}, {'name':'min_samples_split', 'type':'continuous', 'domain':(2, 14.49)}, {'name':'min_samples_leaf', 'type':'continuous', 'domain':(1, 14.49)}, {'name':'n_estimators', 'type':'discrete', 'domain':(25, 50, 75, 100, 125, 150, 175, 200, 225, 250)}, {'name':'criterion', 'type':'discrete', 'domain':(1, 2)}]
Now we create our Sci-kit learn model.
def etrees_crossval(params): dict_criterion={1:'gini', 2:'entropy'} etrees = ExtraTreesClassifier( max_features=params[0][0], max_samples=params[0][1], min_samples_split=int(params[0][2]), min_samples_leaf=int(params[0][3]), n_estimators=int(params[0][4]), criterion=dict_criterion[int(params[0][5])], bootstrap=True, n_jobs=-1, verbose=0) mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, cv=cv_folds, n_jobs=-1).mean() return mean_cv_score
params is the search space list of dictionaries. Note that the second element must match the order specified in search_space.
We create our Bayesian optimization model
gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=etrees_crossval, domain=search_space, model_type='GP', initial_design_numdata=num_random_points, initial_design_type='random', acquisition_type='EI', normalize_Y=True, exact_feval=False, acquisition_optimizer_type='lbfgs', model_update_interval=1, evaluator_type='sequential', batch_size=1, num_cores=os.cpu_count(), verbosity=True, verbosity_model=False, maximize=maximize_bool, de_duplication=True)
We are using Gaussian processes, with expected improvement acquisition function and the limited memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs) optimization algorithm. See https://gpyopt.readthedocs.io/en/latest/GPyOpt.methods.html for information about all of the parameters as well as how to use models other than Gaussian processes.
In the next step, we set up log files and run the optimization by specifying the number of trials.
# text file with parameter info, gp info, best results rf = results_dir + 'report' # text file with header: Iteration Y var_1 var_2 etc. ef = results_dir + 'evaluation' # text file with gp model info mf = results_dir + 'models' gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, evaluations_file=ef, models_file=mf)
Finally, we create a Pandas DataFrame to store the hyperparameter values and cross validated balanced accuracy for each trial. Note the negative in front of the evaluation result returned by GPyOpt and that integer values are rendered as reals.
header_params = [] for param in search_space: header_params.append(param['name']) df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params) # gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0) df_results[error_metric] = -gpyopt_bo.Y
3. Best Result and Ensembling
To obtain an ensemble, we apply a threshold on the cross validation error (balanced accuracy) and then refit ExtraTreesClassifier on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting. We also limit the number of models used to build the ensemble.
To obtain the single best result, we sorted the DataFrame mentioned in the previous section to obtain the hyperparameters that yielded the best result.
if type_error in ['balanced_accuracy', 'accuracy']: threshold_multiplier = 1.0 df_params.sort_values(by=type_error, ascending=False, inplace=True) else: raise NameError # apply threshold dict_criterion={1:'gini', 2:'entropy'} accepted_models = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): bool1 = df_params.loc[df_params.index[i],type_error] > threshold_multiplier*threshold bool2 = accepted_models < accepted_models_num if bool1 and bool2: criterion_int = int(df_params.loc[df_params.index[i],'criterion']) ml_model = ExtraTreesClassifier( n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']), max_features=df_params.loc[df_params.index[i],'max_features'], min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']), min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), max_samples=df_params.loc[df_params.index[i],'max_samples'], criterion= dict_criterion[criterion_int], bootstrap=True, n_jobs=-1, verbose=0) ml_model.fit(xcalib, ycalib) list_predicted_prob.append(ml_model.predict_proba(xprod)) # best result if i == 0: y_predicted_class_best = ml_model.predict(xprod) name_result = ml_name + '_best' classification_analysis.classification_analysis( yprod, y_predicted_class_best, list_class_names, save_directory, name_result) accepted_models = accepted_models + 1 if save_models_flag: model_name = ml_name + str(df_params.index[i]) + '_joblib.sav' dump(ml_model, save_directory + model_name) print('\nthere were',accepted_models,' accepted_models\n') # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)
4. Results
We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.
max_features | max_samples | min_samples_split | min_samples_leaf | n_estimators | criterion | balanced_accuracy |
---|---|---|---|---|---|---|
0.150 | 0.990 | 2.000 | 1.000 | 225 | 1 | 0.9520 |
0.150 | 0.990 | 2.000 | 1.000 | 250 | 1 | 0.9515 |
0.150 | 0.990 | 2.000 | 1.000 | 175 | 1 | 0.9515 |
0.150 | 0.990 | 2.000 | 1.000 | 200 | 1 | 0.9514 |
0.150 | 0.990 | 2.005 | 1.000 | 225 | 1 | 0.9507 |
0.169 | 0.989 | 2.709 | 1.174 | 175 | 1 | 0.9506 |
0.192 | 0.944 | 3.878 | 1.985 | 250 | 1 | 0.9505 |
0.167 | 0.945 | 4.337 | 1.172 | 225 | 1 | 0.9505 |
0.150 | 0.990 | 2.000 | 1.000 | 225 | 2 | 0.9503 |
0.350 | 0.970 | 3.918 | 1.096 | 200 | 1 | 0.9499 |
Best results:
balanced accuracy score = 0.9607
accuracy score = 0.961



Ensemble results:
balanced accuracy score = 0.961
accuracy score = 0.9613
5. Code
import os import numpy as np import pandas as pd import GPyOpt from sklearn.ensemble import ExtraTreesClassifier from sklearn.model_selection import cross_val_score from joblib import dump import time from pathlib import Path # ******************************************************************* def create_save_models(x, y, error_metric, cv_folds, num_random_points, num_iterations, results_dir): if error_metric in ['balanced_accuracy', 'accuracy']: maximize_bool = True else: raise NameError start_time_total = time.time() search_space = [ {'name':'max_features', 'type':'continuous', 'domain':(0.15, 1.0)}, {'name':'max_samples', 'type':'continuous', 'domain':(0.6, 0.99)}, {'name':'min_samples_split', 'type':'continuous', 'domain':(2, 14.49)}, {'name':'min_samples_leaf', 'type':'continuous', 'domain':(1, 14.49)}, {'name':'n_estimators', 'type':'discrete', 'domain':(25, 50, 75, 100, 125, 150, 175, 200, 225, 250)}, {'name':'criterion', 'type':'discrete', 'domain':(1, 2)}] # create sklearn model def etrees_crossval(params): dict_criterion={1:'gini', 2:'entropy'} etrees = ExtraTreesClassifier( max_features=params[0][0], max_samples=params[0][1], min_samples_split=int(params[0][2]), min_samples_leaf=int(params[0][3]), n_estimators=int(params[0][4]), criterion=dict_criterion[int(params[0][5])], bootstrap=True, n_jobs=-1, verbose=0) mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric, cv=cv_folds, n_jobs=-1).mean() return mean_cv_score gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=etrees_crossval, domain=search_space, model_type='GP', initial_design_numdata=num_random_points, initial_design_type='random', acquisition_type='EI', normalize_Y=True, exact_feval=False, acquisition_optimizer_type='lbfgs', model_update_interval=1, evaluator_type='sequential', batch_size=1, num_cores=os.cpu_count(), verbosity=True, verbosity_model=False, maximize=maximize_bool, de_duplication=True) # text file with parameter info, gp info, best results rf = results_dir + 'report' # text file with header: Iteration Y var_1 var_2 etc. ef = results_dir + 'evaluation' # text file with gp model info mf = results_dir + 'models' gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf, evaluations_file=ef, models_file=mf) # put results in dfs header_params = [] for param in search_space: header_params.append(param['name']) df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params) # gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0) df_results[error_metric] = -gpyopt_bo.Y df_results.to_pickle(results_dir + 'df_results.pkl') df_results.to_csv(results_dir + 'df_results.csv') elapsed_time_total = (time.time()-start_time_total)/60 print('\n\ntotal elapsed time =',elapsed_time_total,' minutes') # end of create_save_models() # ******************************************************************* def make_final_predictions(xcalib, ycalib, xprod, yprod, save_directory, save_models_flag, df_params, threshold, type_error, accepted_models_num, ml_name, list_class_names): if type_error in ['balanced_accuracy', 'accuracy']: threshold_multiplier = 1.0 df_params.sort_values(by=type_error, ascending=False, inplace=True) else: raise NameError # apply threshold dict_criterion={1:'gini', 2:'entropy'} accepted_models = 0 list_predicted_prob = [] num_models = df_params.shape[0] for i in range(num_models): bool1 = df_params.loc[df_params.index[i],type_error] > threshold_multiplier*threshold bool2 = accepted_models < accepted_models_num if bool1 and bool2: criterion_int = int(df_params.loc[df_params.index[i],'criterion']) ml_model = ExtraTreesClassifier( n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']), max_features=df_params.loc[df_params.index[i],'max_features'], min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']), min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']), max_samples=df_params.loc[df_params.index[i],'max_samples'], criterion= dict_criterion[criterion_int], bootstrap=True, n_jobs=-1, verbose=0) ml_model.fit(xcalib, ycalib) list_predicted_prob.append(ml_model.predict_proba(xprod)) # best results if i == 0: y_predicted_class_best = ml_model.predict(xprod) name_result = ml_name + '_best' classification_analysis.classification_analysis( yprod, y_predicted_class_best, list_class_names, save_directory, name_result) accepted_models = accepted_models + 1 if save_models_flag: model_name = ml_name + str(df_params.index[i]) + '_joblib.sav' dump(ml_model, save_directory + model_name) print('\nthere were',accepted_models,' accepted_models\n') # compute mean probabilities mean_probabilities = np.mean(list_predicted_prob, axis=0) # compute predicted class # argmax uses 1st ocurrance in case of a tie y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1) # end of make_final_predictions() # ******************************************************************* if __name__ == '__main__': ml_algorithm_name = 'etrees' file_name_stub = 'gpyopt_' + ml_algorithm_name calculation_type = 'production' base_directory = YOUR DIRECTORY data_directory = YOUR DIRECTORY results_directory_stub = base_directory + file_name_stub + '/' if not Path(results_directory_stub).is_dir(): os.mkdir(results_directory_stub) # fixed parameters error_type = 'balanced_accuracy' threshold_error = 0.93 cross_valid_folds = 3 total_number_of_iterations = 100 number_of_random_points = 20 # random searches to start opt process number_of_iterations = total_number_of_iterations - number_of_random_points save_models = False num_models_to_accept = 10 # use small data set x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy') y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy') # 1 - calibration if calculation_type == 'calibration': results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) create_save_models(x_calib, y_calib, error_type, cross_valid_folds, number_of_random_points, number_of_iterations, results_directory) # 2 - production - apply threshold elif calculation_type == 'production': # get etrees parameters models_dir = results_directory_stub + 'calibration/' df_parameters = pd.read_pickle(models_dir + 'df_results.pkl') results_directory = results_directory_stub + calculation_type + '/' if not Path(results_directory).is_dir(): os.mkdir(results_directory) x_prod = np.load(data_directory + 'x_mnist_production.npy') y_prod = np.load(data_directory + 'y_mnist_production.npy') num_classes = np.unique(y_prod).shape[0] class_names_list = [] for i in range(num_classes): class_names_list.append('class ' + str(i)) make_final_predictions(x_calib, y_calib, x_prod, y_prod, results_directory, save_models, df_parameters, threshold_error, error_type, num_models_to_accept, ml_algorithm_name, class_names_list) else: print('\ninvalid calculation type:',calculation_type) raise NameError