Select Page

### 1. GPyOpt Python Package

GPyOpt is a Python open-source library for Bayesian Optimization developed by the Machine Learning group of the University of Sheffield. It is based on GPy, a Python framework for Gaussian process modelling.

In this article, we demonstrate how to use this package to do hyperparameter search for a classification problem with Scikit-learn.

### 2. Using GPyOpt

Below are code fragments showing how to integrate the package with Scikit-learn.

We begin by specifying if the problem is one of minimization or maximization. Here, we are again using the ExtraTreesClassifier for the MNIST handwritten digits data set so we wish to maximize the balanced accuracy.

```if error_metric in ['balanced_accuracy', 'accuracy']:
maximize_bool = True
else:
raise NameError
```

Then we set the search space as a single element list of dictionaries in which we label the variable names, set variable types to continuous or discrete, and specify their range of values.

```search_space = [
{'name':'max_features', 'type':'continuous', 'domain':(0.15, 1.0)},
{'name':'max_samples', 'type':'continuous', 'domain':(0.6, 0.99)},
{'name':'min_samples_split', 'type':'continuous', 'domain':(2, 14.49)},
{'name':'min_samples_leaf', 'type':'continuous', 'domain':(1, 14.49)},
{'name':'n_estimators', 'type':'discrete',
'domain':(25, 50, 75, 100, 125, 150, 175, 200, 225, 250)},
{'name':'criterion', 'type':'discrete', 'domain':(1, 2)}]
```

Now we create our Sci-kit learn model.

```def etrees_crossval(params):
dict_criterion={1:'gini', 2:'entropy'}
etrees = ExtraTreesClassifier(
max_features=params[0][0],
max_samples=params[0][1],
min_samples_split=int(params[0][2]),
min_samples_leaf=int(params[0][3]),
n_estimators=int(params[0][4]),
criterion=dict_criterion[int(params[0][5])],
bootstrap=True, n_jobs=-1, verbose=0)

mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric,
cv=cv_folds, n_jobs=-1).mean()

return mean_cv_score
```

params is the search space list of dictionaries. Note that the second element must match the order specified in search_space.

We create our Bayesian optimization model

```gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=etrees_crossval, domain=search_space,
model_type='GP', initial_design_numdata=num_random_points,
initial_design_type='random', acquisition_type='EI',
normalize_Y=True, exact_feval=False,
acquisition_optimizer_type='lbfgs',
model_update_interval=1, evaluator_type='sequential',
batch_size=1, num_cores=os.cpu_count(), verbosity=True,
verbosity_model=False, maximize=maximize_bool, de_duplication=True)
```

We are using Gaussian processes, with expected improvement acquisition function and the limited memory Broyden–Fletcher–Goldfarb–Shanno (lbfgs) optimization algorithm. See https://gpyopt.readthedocs.io/en/latest/GPyOpt.methods.html for information about all of the parameters as well as how to use models other than Gaussian processes.

In the next step, we set up log files and run the optimization by specifying the number of trials.

```# text file with parameter info, gp info, best results
rf = results_dir + 'report'

# text file with header: Iteration Y var_1 var_2 etc.
ef = results_dir + 'evaluation'

# text file with gp model info
mf = results_dir + 'models'
gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf,
evaluations_file=ef, models_file=mf)
```

Finally, we create a Pandas DataFrame to store the hyperparameter values and cross validated balanced accuracy for each trial. Note the negative in front of the evaluation result returned by GPyOpt and that integer values are rendered as reals.

```header_params = []
for param in search_space:

df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)

# gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0)
df_results[error_metric] = -gpyopt_bo.Y
```

### 3. Best Result and Ensembling

To obtain an ensemble, we apply a threshold on the cross validation error (balanced accuracy) and then refit ExtraTreesClassifier on the entire calibration data set. Then we input the production data (unseen data) to obtain class probabilities, NOT the classes themselves. These probabilities are then averaged and the final predicted class is determined via plurality voting. We also limit the number of models used to build the ensemble.

To obtain the single best result, we sorted the DataFrame mentioned in the previous section to obtain the hyperparameters that yielded the best result.

```if type_error in ['balanced_accuracy', 'accuracy']:
threshold_multiplier = 1.0
df_params.sort_values(by=type_error, ascending=False, inplace=True)
else:
raise NameError

# apply threshold
dict_criterion={1:'gini', 2:'entropy'}
accepted_models = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
bool1 = df_params.loc[df_params.index[i],type_error] > threshold_multiplier*threshold
bool2 = accepted_models < accepted_models_num
if bool1 and bool2:

criterion_int = int(df_params.loc[df_params.index[i],'criterion'])

ml_model = ExtraTreesClassifier(
n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
max_features=df_params.loc[df_params.index[i],'max_features'],
min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']),
max_samples=df_params.loc[df_params.index[i],'max_samples'],
criterion= dict_criterion[criterion_int],
bootstrap=True, n_jobs=-1, verbose=0)

ml_model.fit(xcalib, ycalib)

list_predicted_prob.append(ml_model.predict_proba(xprod))

# best result
if i == 0:
y_predicted_class_best = ml_model.predict(xprod)

name_result = ml_name + '_best'
classification_analysis.classification_analysis(
yprod, y_predicted_class_best, list_class_names,
save_directory, name_result)

accepted_models = accepted_models + 1

if save_models_flag:
model_name = ml_name + str(df_params.index[i]) + '_joblib.sav'
dump(ml_model, save_directory + model_name)

print('\nthere were',accepted_models,' accepted_models\n')

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)
```

### 4. Results

We used the MNIST handwritten digits data set. To reduce run time, we used half of the training data as our calibration data.

max_featuresmax_samplesmin_samples_splitmin_samples_leafn_estimatorscriterionbalanced_accuracy
0.1500.9902.0001.00022510.9520
0.1500.9902.0001.00025010.9515
0.1500.9902.0001.00017510.9515
0.1500.9902.0001.00020010.9514
0.1500.9902.0051.00022510.9507
0.1690.9892.7091.17417510.9506
0.1920.9443.8781.98525010.9505
0.1670.9454.3371.17222510.9505
0.1500.9902.0001.00022520.9503
0.3500.9703.9181.09620010.9499

Best results:
balanced accuracy score = 0.9607
accuracy score = 0.961

Ensemble results:
balanced accuracy score = 0.961
accuracy score = 0.9613

### 5. Code

```import os
import numpy as np
import pandas as pd
import GPyOpt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from joblib import dump
import time
from pathlib import Path

# *******************************************************************

def create_save_models(x, y, error_metric, cv_folds,
num_random_points, num_iterations, results_dir):

if error_metric in ['balanced_accuracy', 'accuracy']:
maximize_bool = True
else:
raise NameError

start_time_total = time.time()

search_space = [
{'name':'max_features', 'type':'continuous', 'domain':(0.15, 1.0)},
{'name':'max_samples', 'type':'continuous', 'domain':(0.6, 0.99)},
{'name':'min_samples_split', 'type':'continuous', 'domain':(2, 14.49)},
{'name':'min_samples_leaf', 'type':'continuous', 'domain':(1, 14.49)},
{'name':'n_estimators', 'type':'discrete',
'domain':(25, 50, 75, 100, 125, 150, 175, 200, 225, 250)},
{'name':'criterion', 'type':'discrete', 'domain':(1, 2)}]

# create sklearn model
def etrees_crossval(params):
dict_criterion={1:'gini', 2:'entropy'}
etrees = ExtraTreesClassifier(
max_features=params[0][0],
max_samples=params[0][1],
min_samples_split=int(params[0][2]),
min_samples_leaf=int(params[0][3]),
n_estimators=int(params[0][4]),
criterion=dict_criterion[int(params[0][5])],
bootstrap=True, n_jobs=-1, verbose=0)

mean_cv_score = cross_val_score(etrees, x, y, scoring=error_metric,
cv=cv_folds, n_jobs=-1).mean()

return mean_cv_score

gpyopt_bo = GPyOpt.methods.BayesianOptimization(f=etrees_crossval, domain=search_space,
model_type='GP', initial_design_numdata=num_random_points,
initial_design_type='random', acquisition_type='EI',
normalize_Y=True, exact_feval=False,
acquisition_optimizer_type='lbfgs',
model_update_interval=1, evaluator_type='sequential',
batch_size=1, num_cores=os.cpu_count(), verbosity=True,
verbosity_model=False, maximize=maximize_bool, de_duplication=True)

# text file with parameter info, gp info, best results
rf = results_dir + 'report'

# text file with header: Iteration	Y var_1 var_2 etc.
ef = results_dir + 'evaluation'

# text file with gp model info
mf = results_dir + 'models'
gpyopt_bo.run_optimization(max_iter=num_iterations, report_file=rf,
evaluations_file=ef, models_file=mf)

# put results in dfs
for param in search_space:

df_results = pd.DataFrame(data=gpyopt_bo.X, columns=header_params)

# gpyopt yields -0.91, integer values are rendered as real (1 -> 1.0)
df_results[error_metric] = -gpyopt_bo.Y
df_results.to_pickle(results_dir + 'df_results.pkl')
df_results.to_csv(results_dir + 'df_results.csv')

elapsed_time_total = (time.time()-start_time_total)/60
print('\n\ntotal elapsed time =',elapsed_time_total,' minutes')

# end of create_save_models()

# *******************************************************************

def make_final_predictions(xcalib, ycalib, xprod, yprod,
save_directory, save_models_flag, df_params,
threshold, type_error, accepted_models_num,
ml_name, list_class_names):

if type_error in ['balanced_accuracy', 'accuracy']:
threshold_multiplier = 1.0
df_params.sort_values(by=type_error, ascending=False, inplace=True)
else:
raise NameError

# apply threshold
dict_criterion={1:'gini', 2:'entropy'}
accepted_models = 0
list_predicted_prob = []
num_models = df_params.shape[0]
for i in range(num_models):
bool1 = df_params.loc[df_params.index[i],type_error] > threshold_multiplier*threshold
bool2 = accepted_models < accepted_models_num
if bool1 and bool2:

criterion_int = int(df_params.loc[df_params.index[i],'criterion'])

ml_model = ExtraTreesClassifier(
n_estimators=int(df_params.loc[df_params.index[i],'n_estimators']),
max_features=df_params.loc[df_params.index[i],'max_features'],
min_samples_split=int(df_params.loc[df_params.index[i],'min_samples_split']),
min_samples_leaf=int(df_params.loc[df_params.index[i],'min_samples_leaf']),
max_samples=df_params.loc[df_params.index[i],'max_samples'],
criterion= dict_criterion[criterion_int],
bootstrap=True, n_jobs=-1, verbose=0)

ml_model.fit(xcalib, ycalib)

list_predicted_prob.append(ml_model.predict_proba(xprod))

# best results
if i == 0:
y_predicted_class_best = ml_model.predict(xprod)

name_result = ml_name + '_best'
classification_analysis.classification_analysis(
yprod, y_predicted_class_best, list_class_names,
save_directory, name_result)

accepted_models = accepted_models + 1

if save_models_flag:
model_name = ml_name + str(df_params.index[i]) + '_joblib.sav'
dump(ml_model, save_directory + model_name)

print('\nthere were',accepted_models,' accepted_models\n')

# compute mean probabilities
mean_probabilities = np.mean(list_predicted_prob, axis=0)

# compute predicted class
# argmax uses 1st ocurrance in case of a tie
y_predicted_class_ensemble = np.argmax(mean_probabilities, axis=1)

# end of make_final_predictions()

# *******************************************************************

if __name__ == '__main__':

ml_algorithm_name = 'etrees'
file_name_stub = 'gpyopt_' + ml_algorithm_name

calculation_type = 'production'

base_directory = YOUR DIRECTORY
data_directory = YOUR DIRECTORY

results_directory_stub = base_directory + file_name_stub + '/'
if not Path(results_directory_stub).is_dir():
os.mkdir(results_directory_stub)

# fixed parameters
error_type = 'balanced_accuracy'
threshold_error = 0.93
cross_valid_folds = 3
total_number_of_iterations = 100
number_of_random_points = 20 # random searches to start opt process
number_of_iterations =  total_number_of_iterations - number_of_random_points
save_models = False
num_models_to_accept = 10

# use small data set
x_calib = np.load(data_directory + 'x_mnist_calibration_1.npy')
y_calib = np.load(data_directory + 'y_mnist_calibration_1.npy')

# 1 - calibration
if calculation_type == 'calibration':

results_directory = results_directory_stub + calculation_type + '/'
if not Path(results_directory).is_dir():
os.mkdir(results_directory)

create_save_models(x_calib, y_calib, error_type,
cross_valid_folds,
number_of_random_points, number_of_iterations,
results_directory)
# 2 - production - apply threshold
elif calculation_type == 'production':
# get etrees parameters
models_dir = results_directory_stub + 'calibration/'
df_parameters = pd.read_pickle(models_dir + 'df_results.pkl')

results_directory = results_directory_stub + calculation_type + '/'
if not Path(results_directory).is_dir():
os.mkdir(results_directory)

x_prod = np.load(data_directory + 'x_mnist_production.npy')
y_prod = np.load(data_directory + 'y_mnist_production.npy')

num_classes = np.unique(y_prod).shape[0]
class_names_list = []
for i in range(num_classes):
class_names_list.append('class ' + str(i))

make_final_predictions(x_calib, y_calib, x_prod, y_prod,
results_directory, save_models, df_parameters,
threshold_error, error_type, num_models_to_accept,
ml_algorithm_name, class_names_list)
else:
print('\ninvalid calculation type:',calculation_type)
raise NameError
```