Bayesian Optimization for Hyperparameter Tuning – Clearly explained.

Join thousands of students who advanced their careers with MachineLearningPlus. Go from Beginner to Data Science Expert through a structured road map of 70+ courses in 9 core specializations. Build industry grade Data Science projects.

Bayesian Optimization is a method used for optimizing ‘expensive-to-evaluate’ functions, particularly useful in hyperparameter tuning for machine learning models.

Let’s understand how it works and the math behind it with all the detail.

Overview of Bayesian Optimization

Bayesian optimization for hyperparameter tuning involves the following steps. We will break it down to simple details after this.

Surrogate Model: Bayesian optimization builds a probabilistic model of the objective function, where the objective function is typically the error (loss function) or accuracy of a machine learning model calculated on validation dataset. This surrogate models is typically built using Gaussian processes or other regression models. Usually it predicts both the mean and the variance of the objective function (score function).
Acquisition Function: It uses this model to guide the search for the next set of hyperparameters to evaluate by optimizing an acquisition function.
Evaluate and Update: The chosen hyperparameters from the previous step are evaluated using the true objective function (e.g., model performance on validation data). The surrogate model is then updated with this new information.
Iterate: Steps 2 and 3 are repeated until a stopping criterion is met (e.g., a fixed number of iterations or convergence).

Math Details

1. Surrogate Model

Gaussian Process (GP)

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. It’s specified by a mean function $( m(x) )$ and a covariance function (kernel) $( k(x, x’) )$:

$[ f(x) \sim \mathcal{GP}(m(x), k(x, x’)) ]$

Mean function: $( m(x) )$: Usually assumed to be zero for simplicity.
Kernel function: $( k(x, x’) )$: Measures the similarity between points $( x )$ and $( x’ )$.

Given a set of observations $( {(x_i, y_i)}_{i=1}^n )$, where $( y_i )$ is the function value at $( x_i )$, we assume:

$[ y_i = f(x_i) + \epsilon ]$

where $( \epsilon )$ is Gaussian noise with variance $( \sigma_n^2 )$.

2. Acquisition Function

The acquisition function determines the next point to evaluate by balancing exploration (trying new regions) and exploitation (focusing on promising regions). Common acquisition functions include:

Expected Improvement (EI)

The Expected Improvement at a point $( x )$ is given by:

$[ EI(x) = \mathbb{E}[\max(0, f(x) – f(x^+))] ]$

where $( f(x^+) )$ is the current best observed value.

Upper Confidence Bound (UCB)

The Upper Confidence Bound at a point $( x )$ is given by:

$[ UCB(x) = \mu(x) + \kappa \sigma(x) ]$

where $( \mu(x) )$ and $( \sigma(x) )$ are the mean and standard deviation predicted by the Gaussian Process, and $( \kappa )$ is a parameter balancing exploration and exploitation.

3. Evaluate and Update

After selecting $( x_{new} )$ by optimizing the acquisition function, we evaluate the true objective function $( y_{new} = f(x_{new}) )$. We then update the Gaussian Process with this new data point, improving the model’s accuracy.

4. Iterate

This process of selecting new points and updating the surrogate model is repeated. The Bayesian optimization algorithm continues to balance exploration and exploitation, efficiently searching the hyperparameter space for the optimal set.

Simple explanation of Bayesian optimization

Let’s break down Bayesian Optimization for hyperparameter search into simpler terms with a step-by-step approach.

First, let’s understand the purpose of Surrogate models.

Imagine we have a map, but it’s mostly blank. We have only a few points where we’ve been and know the terrain (like how well certain hyperparameters work). The surrogate model is our way of filling in the gaps on the map. It guesses what the terrain looks like in the blank spaces based on what we know.

In Bayesian Optimization, we often use something called a Gaussian Process for this guessing. It gives us two things for any set of hyperparameters:
– A guess of how well these hyperparameters will work (mean).
– How sure we are about this guess (variance).

Acquisition Function

Next, we need to decide where to go next on our map. The acquisition function helps us do this by balancing two things:

Exploration: Trying new areas we haven’t checked yet.

Exploitation: Focusing on areas we know are good.

Some common acquisition functions include:

Expected Improvement (EI): It chooses the next point where we expect the most improvement over our current best result.
Upper Confidence Bound (UCB): It looks for points that have a good balance of a high guess and high uncertainty, so we can learn more.

Once we pick a new point using the acquisition function, we try out those hyperparameters (train our model with them) and see how well they actually perform. We then add this new information to our surrogate model, updating our map with more accurate guesses and uncertainties.

We repeat the process:
1. Use the surrogate model to guess the terrain.
2. Use the acquisition function to pick a new point to try.
3. Try the new hyperparameters and update our guesses.

We keep doing this until we find the best hyperparameters or run out of time/computation budget.

Python Code Implementation using GPyOpt

Install packages

pip install scikit-learn gpyopt matplotlib

Import libraries

import numpy as np
import GPy
import GPyOpt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

Load the dataset

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

Define the objective (score) function

def objective_function(parameters):
    # Convert the hyperparameters to integers
    parameters = parameters[0]
    n_estimators = int(parameters[0])
    max_depth = int(parameters[1])

    # Define the classifier with the specified hyperparameters
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=0)

    # Calculate cross-validation scores (accuracy)
    scores = cross_val_score(clf, X, y, cv=5)

    # Return the average accuracy (to be maximized)
    return -np.mean(scores)  # Minimize negative accuracy -> maximize accuracy

Define the bounds for the hyper parameters

# Bounds for hyperparameters (n_estimators and max_depth)
bounds = [
    {'name': 'n_estimators', 'type': 'discrete', 'domain': (10, 50, 100, 150)},
    {'name': 'max_depth', 'type': 'discrete', 'domain': (2, 5, 10, 20)},
]

Perform optimization and print results

# Initialize Bayesian Optimization
optimizer = GPyOpt.methods.BayesianOptimization(f=objective_function, domain=bounds, acquisition_type='EI')

# Run the optimization
optimizer.run_optimization(max_iter=10)  # Specify the number of iterations

# Get the best hyperparameters found
best_params = optimizer.X[np.argmin(optimizer.Y)]

# Convert to integer values
best_n_estimators = int(best_params[0])
best_max_depth = int(best_params[1])

print("Best n_estimators:", best_n_estimators)
print("Best max_depth:", best_max_depth)
print("Best accuracy found:", -optimizer.fx_opt)  # Convert back from negative accuracy

Train the best model and get results

# Train the final model with the best hyperparameters
best_model = RandomForestClassifier(n_estimators=best_n_estimators, max_depth=best_max_depth, random_state=0)
best_model.fit(X, y)

# Split the data into training and testing sets for evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model on the training set
best_model.fit(X_train, y_train)

# Predict on the testing set
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy of the best model:", accuracy)

Machine Learning

Complete Data Science and AI Roadmap by ML+

Sep 01, 2024

Machine Learning

Mutual information vs Cross Entropy

Aug 15, 2024

Machine Learning