gpp_model_selection

Contents:

gpp_model_selection.hpp

Table of Contents:

  1. FILE OVERVIEW
  2. MODEL SELECTION OVERVIEW
  3. LOG LIKELIHOOD METRICS OF MODEL QUALITY
    1. LOG MARGINAL LIKELIHOOD (LML)
    2. LEAVE ONE OUT CROSS VALIDATION (LOO-CV)
  4. HYPERPARAMETER OPTIMIZATION OF LOG LIKELIHOOD
  5. IMPLEMENTATION NOTES

1. FILE OVERVIEW

As a preface, you should read gpp_math.hpp’s comments first (if not also gpp_math.cpp) to get an overview of Gaussian Processes (GPs) and how we are using them (Expected Improvement, EI).

Note

These comments have been copied into interfaces/log_likelihood_interface.py (file comments) and cpp_wrappers/log_likelihood.py (LogMarginalLikelihood and LeaveOneOutLogLikelihood class comments).

This file deals with model selection via hyperparameter optimization, as the name implies. In our discussion of GPs, we did not pay much attention to the underlying covariance function. We noted that the covariance is extremely important since it encodes our assumptions about the objective function f(x) that we are trying to learn; i.e., the covariance describes the nearness/similarity of points in the input space. Also, the GP was clearly indicated to be a function of the covariance, but we simply assumed that the selection of covariance was an already-solved problem (or even worse, arbitrary!).

To tackle the model selection problem, this file provides some classes that encapsulate log likelihood measures of model quality:

  • LogMarginalLikelihoodEvaluator
  • LeaveOneOutLogLikelihoodEvaluator

both of which follow the Evaluator/State “idiom” described in gpp_common.hpp.

For selecting the best hyperparameters, this file provides two multistart optimization wrappers for gradient descent and Newton, that maximize the previous log likelihood measures:

  • MultistartGradientDescentHyperparameterOptimization<LogLikelihoodEvaluator, Domain>()
  • MultistartNewtonHyperparameterOptimization<LogLikelihoodEvaluator, Domain>()

These functions are wrappers for templated code in gpp_optimization.hpp. The wrappers just set up inputs for use with the routines in gpp_optimization.hpp. These are the preferred endpoints for hyperparameter optimization.

If these derivative-based techniques fail, then we also have simple ‘dumb’ search fallbacks:

  • LatinHypercubeSearchHyperparameterOptimization<LogLikelihoodEvaluator>() (eval log-likelihood at random points)
  • EvaluateLogLikelihoodAtPointList<LogLikelihoodEvaluator, Domain>() (eval log-likelihood at specified points)

This file also provides single-start versions of each optimization technique:

  • RestartedGradientDescentHyperparameterOptimization<LogLikelihoodEvaluator, Domain>()
  • NewtonHyperparameterOptimization<LogLikelihoodEvaluator, Domain>()

Typically these will not be called directly.

To better understand model selection, let’s look at a common covariance used in our computation, square exponential:

cov(x_1, x_2) = \alpha * \exp(-0.5*r^2), where r = \sum_{i=1}^d (x_1_i - x_2_i)^2 / L_i^2.

Here, \alpha is \sigma_f^2, the signal variance, and the L_i are length scales. The vector [\alpha, L_1, ... , L_d] are called the “hyperparameters” or “free parameters” (see gpp_covariance.hpp for more details). There is nothing in the covariance that guides the choice of the hyperparameters; L_1 = 0.001 is just as valid as L_1 = 1000.0.

Clearly, the value of the covariance changes substantially if L_i varies by a factor of two, much less 6 orders of magnitude. That is the difference between saying variations of size approx 1.0 in x_i, the first spatial dimension, are extremely important vs almost irrelevant.

So how do we know what hyperparameters to choose? This question/problem is more generally called “Model Selection.” Although the problem is far from solved, we will present the approaches implemented here; as usual, we use Rasmussen & Williams (Chapter 5 now) as a guide/reference.

However, we will not spend much time discussing selection across different classes of covariance functions; e.g., Square Exponential vs Matern w/various \nu, etc. We have yet to develop any experience/intuition with this problem and are temporarily punting it. For now, we follow the observation in Rasmussen & Williams that Square Exponential is a popular choice and appears to work very well. (This is still a very important problem; e.g., there may be scenarios when we would prefer a non-stationary or periodic covariance, and the methods discussed here do not cover this aspect of selection. Such covariance options are not yet implemented though.)

We do note that the techniques for selecting covariance classes more or less require hyperparameter optimization on each individual covariance. The likely approach would be to produce the best fit (according to chosen metrics) using each type of covariance (using optimization) and then choose the best performer across the group.

Following R&W, the following discussion:

  1. Provides some more concerete description/overview of the model selection problem
  2. Discusses the two metrics for model evaluation that we currently have
  3. Discusses the currently implemented optimization techniques for performing hyperparameter optimization

2. MODEL SELECTION OVERVIEW

Generally speaking, there are a great many tunable parameters in any model-based learning algorithm. In our case, the GP takes a covariance function as input; the selection of the covariance class as well as the choice of hyperparameters are all part of the model selection process. Determining these details of the [GP] model is the model selection problem.

In order to evaluate the quality of models (and solve model selction), we need some kind of metric. The literature suggests too many to cite, but R&W groups them into three common approaches (5.1, p108):

  1. compute the probability of the model given the data (e.g., LML)
  2. estimate the genereralization error (e.g., LOO-CV)
  3. bound the generalization error

where “generalization error” is defined as “the average error on unseen test examples (from the same distribution as the training cases).” So it’s a measure of how well or poorly the model predicts reality. This idea will be more clear in the LOO-CV discussion. Let’s dive into that next.

3. LOG LIKELIHOOD METRICS OF MODEL QUALITY

3a. LOG MARGINAL LIKELIHOOD (LML)

(Rasmussen & Williams, 5.4.1) The Log Marginal Likelihood measure comes from the ideas of Bayesian model selection, which use Bayesian inference to predict distributions over models and their parameters. The cpp file comments explore this idea in more depth. For now, we will simply state the relevant result. We can build up the notion of the “marginal likelihood”: probability(observed data GIVEN sampling points (X), model hyperparameters, model class (regression, GP, etc.)), which is denoted: p(y | X, \theta, H_i) (see the cpp file comments for more).

So the marginal likelihood deals with computing the probability that the observed data was generated from (another way: is easily explainable by) the given model.

The marginal likelihood is in part paramaterized by the model’s hyperparameters; e.g., as mentioned above. Thus we can search for the set of hyperparameters that produces the best marginal likelihood and use them in our model. Additionally, a nice property of the marginal likelihood optimization is that it automatically trades off between model complexity and data fit, producing a model that is reasonably simple while still explaining the data reasonably well. See the cpp file comments for more discussion of how/why this works.

In general, we do not want a model with perfect fit and high complexity, since this implies overfit to input noise. We also do not want a model with very low complexity and poor data fit: here we are washing the signal out with (assumed) noise, so the model is simple but it provides no insight on the data.

This is not magic. Using GPs as an example, if the covariance function is completely mis-specified, we can blindly go through with marginal likelihood optimization, obtain an “optimal” set of hyperparameters, and proceed... never realizing that our fundamental assumptions are wrong. So care is always needed.

3b. LEAVE ONE OUT CROSS VALIDATION (LOO-CV)

(Rasmussen & Williams, Chp 5.4.2) In cross validation, we split the training data, X, into two sets–a sub-training set and a validation set. Then we train a model on the sub-training set and test it on the validation set. Since the validation set comes from the original training data, we can compute the error. In effect we are examining how well the model explains itself.

Leave One Out CV works by considering n different validation sets, one at a time. Each point of X takes a turn being the sole member of the validation set. Then for each validation set, we compute a log pseudo-likelihood, measuring how probable that validation set is given the remaining training data and model hyperparameters.

Again, we can maximize this quanitity over hyperparameters to help us choose the “right” set for the GP.

4. HYPERPARAMETER OPTIMIZATION OF LOG LIKELIHOOD

Now that we have discussed the Log Marginal Likelihood and Leave One Out Cross Validation log pseudo-likelihood measures of model quality, what do we do with them? How do they help us choose hyperparameters?

From here, we can apply anyone’s favorite optimization technique to maximize log likelihoods wrt hyperparameters. The hyperparameters that maximize log likelihood provide the model configuration that is most likely to have produced the data observed so far, (X, f).

In principle, this approach always works. But in practice it is often not that simple. For example, suppose the underlying objective is periodic and we try to optimize hyperparameters for a class of covariance functions that cannot account for the periodicity. We can always* find the set of hyperparameters that maximize our chosen log likelihood measure (LML or LOO-CV), but if the covariance is mis-specified or we otherwise make invalid assumptions about the objective function, then the results are not meaningful at best and misleading at worst. It becomes a case of garbage in, garbage out.

* Even this is tricky. Log likelihood is almost never a convex function. For example, with LML + GPs, you often expect at least two optima, one more complex solution (short length scales, less intrinsic noise) and one less complex solution (longer length scales, higher intrinsic noise). There are even cases where no optima (to machine precision) exist or cases where solutions lie on (lower-dimensional) manifold(s) (e.g., locally the likelihood is (nearly) independent of one or more hyperparameters).

5. IMPLEMENTATION NOTES

  1. This file has a few primary endpoints for model selection (aka hyperparameter optimization):

    1. LatinHypercubeSearchHyperparameterOptimization<>():

      Takes in a log_likelihood_evaluator describing the prior, covariance, domain, config, etc.; searches over a set of (random) hyperparameters and outputs the set producing the best model fit.

    2. MultistartGradientDescentHyperparameterOptimization<>():

      Takes in a log_likelihood_evaluator describing the prior, covariance, domain, config, etc.; searches for the best hyperparameters (of covariance) using multiple gradient descent runs.

      Single start version available in: RestartedGradientDescentHyperparameterOptimization<>().

    3. MultistartNewtonHyperparameterOptimization<>() (Recommended):

      Takes in a log_likelihood_evaluator describing the prior, covariance, domain, config, etc.; searches for the best hyperparameters (of covariance) using multiple Newton runs.

      Single start version available in: NewtonHyperparameterOptimization<>().

    Note

    See gpp_model_selection.cpp‘s header comments for more detailed implementation notes.

    There are also several other functions with external linkage in this header; these are provided primarily to ease testing and to permit lower level access from python.

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.

Enums

LogLikelihoodTypes enum

Enum for the various log likelihood measures. Convenient for specifying which log likelihood to use in testing and also used by the Python interface to specify which log likelihood measure to optimize in hyperparameter optimization.

Values:

Functions

OL_NONNULL_POINTERS void ConvertFromLogToLinearDomainAndBuildInitialGuesses(int num_hyperparameters, int num_multistarts, UniformRandomGenerator * uniform_generator, std::vector< ClosedInterval > *restrict domain_bounds, std::vector< double > *restrict initial_guesses)

Converts domain_bounds input from log10-space to linear-space. Uniformly samples num_multistarts initial guesses from the log10-space domain and converts them all to linear space.

This is a utility function just for reducing code duplication.

Note

the domain here must be specified in LOG-10 SPACE!

Parameters:
num_hyperparameters:
 dimension of the domain
num_multistarts:
 number of random points to draw
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
domain_bounds[1]:
 std::vector<ClosedInterval> with >= num_hyperparameters elements specifying the boundaries of a n_hyper-dimensional tensor-product domain Specify in LOG-10 SPACE!
initial_guesses[1]:
 std::vector<double> with >= num_hyperparameters*num_multistarts elements. will be overwritten; ordered data[num_hyperparameters][num_multistarts]
Outputs:
uniform_generator[1]:
 UniformRandomGenerator object will have its state changed due to random draws
domain_bounds[1]:
 overwritten with the domain bounds in linear space
initial_guesses[1]:
 overwritten with num_multistarts points sampled uniformly from the log10-space domain

template < typename LogLikelihoodEvaluator >
OL_NONNULL_POINTERS void SetupLogLikelihoodState(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, int max_num_threads, std::vector< typename LogLikelihoodEvaluator::StateType > * state_vector)

Set up state vector.

This is a utility function just for reducing code duplication.

Parameters:
log_likelihood_evaluator:
 evaluator object associated w/the state objects being constructed
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
max_num_threads:
 maximum number of threads for use by OpenMP (generally should be <= # cores)
state_vector[arbitrary]:
 vector of state objects, arbitrary size (usually 0)
Outputs:
state_vector[max_num_threads]:
 vector of states containing max_num_threads properly initialized state objects

template < typename LogLikelihoodEvaluator >
OL_NONNULL_POINTERS void InitializeBestKnownPoint(const LogLikelihoodEvaluator & log_likelihood_evaluator, double const *restrict initial_guesses, int num_hyperparameters, int num_multistarts, typename LogLikelihoodEvaluator::StateType * log_likelihood_state, OptimizationIOContainer * io_container)

Select a valid (point, value) pair to represent the current best known objective value.

This is a utility function just for reducing code duplication.

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of log likelihood
initial_guesses[num_hyperparameters][num_multistarts]:
 list of hyperparameters at which to compute log likelihood
num_hyperparameters:
 dimension of the domain
num_multistarts:
 number of random points to draw
log_likelihood_state[1]:
 properly constructed/configured LogLikelihoodEvaluator::State object
io_container[1]:
 properly constructed OptimizationIOContainer object
Outputs:
log_likelihood_state[1]:
 internal states of state object may be modified
io_container[1]:
 OptimizationIOContainer with its best_objective_value and best_point fields set (according to check_all_points flag)

template < typename LogLikelihoodEvaluator, typename DomainType >
OL_NONNULL_POINTERS void RestartedGradientDescentHyperparameterOptimization(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, const GradientDescentParameters & gd_parameters, const DomainType & domain, double *restrict next_hyperparameters)

Optimize a log likelihood measure of model fit (as a function of the hyperparameters of a covariance function) using the prior (i.e., sampled points, values). Optimization is done using restarted Gradient Descent, via GradientDescentOptimizer<...>::Optimize() from gpp_optimization.hpp. Please see that file for details on gradient descent and see gpp_optimizer_parameters.hpp for the meanings of the GradientDescentParameters.

This function is just a simple wrapper that sets up the Evaluator’s State and calls a general template for restarted GD.

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline implying that this function should be backed by multistarting on a grid (or similar) to provide better chances of a good initial guess.

The ‘dumb’ search component is provided through MultistartGradientDescentHyperparameterOptimization<...>(...) (see below). Generally, calling that function should be preferred. This function is meant for:

  1. easier testing
  2. if you really know what you’re doing

Solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Let n_hyper = covariance_ptr->GetNumberOfHyperparameters();

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of log likelihood and its gradient
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data covariance_ptr->GetCurrentHyperparameters() will be used to obtain the initial guess
gd_parameters:GradientDescentParameters object that describes the parameters controlling hyperparameter optimization (e.g., number of iterations, tolerances, learning rate)
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
Outputs:
next_hyperparameters[n_hyper]:
 the new hyperparameters found by gradient descent

template < typename LogLikelihoodEvaluator >
OL_NONNULL_POINTERS void MultistartGradientDescentHyperparameterOptimization(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, const GradientDescentParameters & gd_parameters, ClosedInterval const *restrict domain, const ThreadSchedule & thread_schedule, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, double *restrict next_hyperparameters)

Function to add multistarting on top of (restarted) gradient descent hyperparameter optimization. Generates num_multistarts initial guesses (random sampling from domain), all within the specified domain, and kicks off an optimization run from each guess.

Same idea as ComputeOptimalPointsToSampleWithRandomStarts() in gpp_math.hpp, which is for optimizing Expected Improvement; see those docs for additional gradient descent details. This is the primary endpoint for hyperparameter optimization using gradient descent. It constructs the required state objects, builds a GradientDescentOptimizer object, and wraps a series of calls:

  • The heart of multistarting is in MultistartOptimizer<...>::MultistartOptimize(...) (in gpp_optimization.hpp).
    • The heart of restarted GD is in GradientDescentOptimizer<...>::Optimize() (in gpp_optimization.hpp).
    • Log likelihood is computed in ComputeLogLikelihood() and its gradient in ComputeGradLogLikelihood(), which must be member functions of the LogLikelihoodEvaluator template parameter.

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline for sizing the domain and gd_parameters.num_multistarts; i.e., be wary of sets of initial guesses that cover the space too sparsely.

Note

the domain here must be specified in LOG-10 SPACE!

Solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Warning

this function fails if NO improvement can be found! In that case, best_next_point will always be the first randomly chosen point. found_flag will be set to false in this case.

Note

the domain here must be specified in LOG-10 SPACE!

Let n_hyper = covariance_ptr->GetNumberOfHyperparameters();

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of gradient + hessian of log likelihood
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
gd_parameters:GradientDescentParameters object that describes the parameters controlling hyperparameter optimization (e.g., number of iterations, tolerances, learning rate)
domain[n_hyper]:
 array of ClosedInterval specifying the boundaries of a n_hyper-dimensional tensor-product domain. Specify in LOG-10 SPACE!
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_dynamic), chunk_size (0).
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
Outputs:
found_flag[1]:true if next_hyperparameters corresponds to a converged solution
uniform_generator[1]:
 UniformRandomGenerator object will have its state changed due to random draws
next_hyperparameters[n_hyper]:
 the new hyperparameters found by gradient descent

template < typename LogLikelihoodEvaluator, typename DomainType >
OL_NONNULL_POINTERS OL_WARN_UNUSED_RESULT int NewtonHyperparameterOptimization(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, const NewtonParameters & newton_parameters, const DomainType & domain, double *restrict next_hyperparameters)

Optimize a log likelihood measure of model fit (as a function of the hyperparameters of a covariance function) using the prior (i.e., sampled points, values). Optimization is done using Newton’s method for optimization, via NewtonOptimization() from gpp_optimization.hpp. Please see that file for details on Newton and see gpp_optimizer_parameters.hpp for the meanings of the NewtonParameters.

This function is just a simple wrapper that sets up the Evaluator’s State and calls a general template for Newton, NewtonOptimization<...>(...) (in gpp_optimization.hpp).

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline implying that this function should be backed by multistarting on a grid (or similar) to provide better chances of a good initial guess.

The ‘dumb’ search component is provided through MultistartNewtonHyperparameterOptimization<...>(...) (see below). Generally, calling that function should be preferred. This is meant for:

  1. easier testing
  2. if you really know what you’re doing
gamma = 1.01, time_factor = 1.0e-3 should lead to good robustness at reasonable speed. This should be a fairly safe default.
gamma = 1.05, time_factor = 1.0e-1 will be several times faster but not as robust.

Let n_hyper = covariance.GetNumberOfHyperparameters();

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of gradient + hessian of log likelihood
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data covariance.GetCurrentHyperparameters() will be used to obtain the initial guess
newton_parameters:
 NewtonParameters object that describes the parameters controlling hyperparameter optimization (e.g., number of iterations, tolerances, diagonal dominance)
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
Outputs:
next_hyperparameters[n_hyper]:
 the new hyperparameters found by newton

template < typename LogLikelihoodEvaluator >
OL_NONNULL_POINTERS void MultistartNewtonHyperparameterOptimization(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, const NewtonParameters & newton_parameters, ClosedInterval const *restrict domain, const ThreadSchedule & thread_schedule, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, double *restrict next_hyperparameters)

Function to add multistarting on top of newton hyperparameter optimization. Generates num_multistarts initial guesses (random sampling from domain), all within the specified domain, and kicks off an optimization run from each guess.

Same idea as ComputeOptimalPointsToSampleWithRandomStarts() in gpp_math.hpp, which is for optimizing Expected Improvement. This is the primary endpoint for hyperparameter optimization using Newton’s method. It constructs the required state objects, builds a NewtonOptimizer object, and wraps a series of calls:

  • The heart of multistarting is in MultistartOptimizer<...>::MultistartOptimize<...>(...) (in gpp_optimization.hpp).
    • The heart of Newton is in NewtonOptimization() (in gpp_optimization.hpp).
    • Log likelihood is computed in ComputeLogLikelihood(), its gradient in ComputeGradLogLikelihood(), and its hessian in
    • ComputeHessianLogLikelihood(), which must be member functions of the LogLikelihoodEvaluator template parameter.

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline for sizing the domain and gd_parameters.num_multistarts; i.e., be wary of sets of initial guesses that cover the space too sparsely.

Solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Warning

this function fails if NO improvement can be found! In that case, best_next_point will always be the first randomly chosen point. found_flag will be set to false in this case.

Note

the domain here must be specified in LOG-10 SPACE!

Let n_hyper = covariance.GetNumberOfHyperparameters();

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of gradient + hessian of log likelihood
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
newton_parameters:
 NewtonParameters object that describes the parameters controlling hyperparameter optimization (e.g., number of iterations, tolerances, diagonal dominance)
domain[n_hyper]:
 array of ClosedInterval specifying the boundaries of a n_hyper-dimensional tensor-product domain. Specify in LOG-10 SPACE!
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_dynamic), chunk_size (0).
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
Outputs:
found_flag[1]:true if next_hyperparameters corresponds to a converged solution
uniform_generator[1]:
 UniformRandomGenerator object will have its state changed due to random draws
next_hyperparameters[n_hyper]:
 the new hyperparameters found by newton

template < typename LogLikelihoodEvaluator, typename DomainType >
void EvaluateLogLikelihoodAtPointList(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, const DomainType & domain_linearspace, const ThreadSchedule & thread_schedule, double const *restrict initial_guesses, int num_multistarts, bool *restrict found_flag, double *restrict function_values, double *restrict next_hyperparameters)

Function to evaluate various log likelihood measures over a specified list of num_multistarts hyperparameters. Optionally outputs the log likelihood at each of these hyperparameters. Outputs the hyperparameters of the input set obtaining the maximum log likelihood value.

Generally gradient descent is preferred but when they fail to converge this may be the only “robust” option. This function is also useful for plotting or debugging purposes (just to get a bunch of log likelihood values).

This function is just a wrapper that builds the required state objects and a NullOptimizer object and calls MultistartOptimizer<...>::MultistartOptimize(...); see gpp_optimization.hpp.

Let n_hyper = covariance.GetNumberOfHyperparameters();

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of log likelihood
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
domain:object specifying the domain to optimize over (see gpp_domain.hpp), specify in LINEAR SPACE!
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_guided), chunk_size (0).
initial_guesses[n_hyper][num_multistarts]:
 list of hyperparameters at which to compute log likelihood
num_multistarts:
 number of random points to generate for use as initial guesses
Outputs:
found_flag[1]:true if next_hyperparameters corresponds to a finite log likelihood
function_values[num_multistarts]:
 log likelihood evaluated at each point of initial_guesses, in the same order as initial_guesses; never dereferenced if nullptr
next_hyperparameters[n_hyper]:
 the new hyperparameters found by “dumb” search

template < typename LogLikelihoodEvaluator >
OL_NONNULL_POINTERS void LatinHypercubeSearchHyperparameterOptimization(const LogLikelihoodEvaluator & log_likelihood_evaluator, const CovarianceInterface & covariance, ClosedInterval const *restrict domain, const ThreadSchedule & thread_schedule, int num_multistarts, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, double *restrict next_hyperparameters)

Function to do a “dumb” search over num_multistarts points (generated on a Latin Hypercube) for the optimal set of hyperparameters (largest log likelihood).

Generally gradient descent or newton are preferred but when they fail to converge this may be the only “robust” option.

This function wraps EvaluateLogLikelihoodAtPointList(), providing it with a uniformly sampled (on a latin hypercube) set of hyperparameters at which to evaluate log likelihood.

Solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Note

the domain here must be specified in LOG-10 SPACE!

Let n_hyper = covariance.GetNumberOfHyperparameters();

Parameters:
log_likelihood_evaluator:
 object supporting evaluation of log likelihood
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
domain[n_hyper]:
 array of ClosedInterval specifying the boundaries of a n_hyper-dimensional tensor-product domain. Specify in LOG-10 SPACE!
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_guided), chunk_size (0).
num_multistarts:
 number of random points to generate for use as initial guesses
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
Outputs:
found_flag[1]:true if next_hyperparameters corresponds to a finite log likelihood
uniform_generator[1]:
 UniformRandomGenerator object will have its state changed due to random draws
next_hyperparameters[n_hyper]:
 the new hyperparameters found by “dumb” search

class LeaveOneOutLogLikelihoodEvaluator

This serves as a quick summary of Leave One Out Cross Validation (LOO-CV). Please see the file comments here and in the corresponding .cpp file for further details.

Class for computing the Leave-One-Out Cross Validation (LOO-CV) Log Pseudo-Likelihood. Given a particular covariance function (including hyperparameters) and training data ((point, function value, measurement noise) tuples), the log LOO-CV pseudo-likelihood expresses how well the model explains itself.

That is, cross validation involves splitting the training set into a sub-training set and a validation set. Then we measure the log likelihood that a model built on the sub-training set could produce the values in the validation set.

Leave-One-Out CV does this process |y| times: on the i-th run, the sub-training set is (X, y) with the i-th point removed and the validation set is the i-th point. Then the predictive performance of each sub-model are aggregated into a psuedo-likelihood.

This quantity primarily deals with the internal consistency of the model–how well it explains itself. The LOO-CV likelihood gives an “estimate for the predictive probability, whether or not the assumptions of the model may be fulfilled.” It is a more frequentist view of model selection. (Rasmussen & Williams p118) See Rasmussen & Williams 5.3 and 5.4.2 for more details.

As with the log marginal likelihood, we can use this quantity to measure the performance of our model. We can also maximize it (via hyperparameter modifications or covariance function changes) to improve model performance. It has also been argued that LOO-CV is better at detecting model mis-specification (e.g., wrong covariance function) than log marginal measures (Rasmussen & Williams p118).

Note

These class comments are duplicated in Python: cpp_wrappers.log_likelihood.LeaveOneOutLogLikelihood

Public Type
Public Functions

LeaveOneOutLogLikelihoodEvaluator(double const *restrict points_sampled_in, double const *restrict points_sampled_value_in, double const *restrict noise_variance_in, int dim_in, int num_sampled_in)

Constructs a LeaveOneOutLogLikelihoodEvaluator object. All inputs are required; no default constructor nor copy/assignment are allowed.

Parameters:
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
points_sampled[dim][num_sampled]:
 points that have already been sampled
points_sampled_value[num_sampled]:
 values of the already-sampled points
noise_variance[num_sampled]:
 the \sigma_n^2 (noise variance) associated w/observation, points_sampled_value
dim:the spatial dimension of a point (i.e., number of independent params in experiment)
num_sampled:number of already-sampled points

int dim()

int num_sampled()

double ComputeObjectiveFunction(StateType * log_likelihood_state)

Wrapper for ComputeLogLikelihood(); see that function for details.

void ComputeGradObjectiveFunction(StateType * log_likelihood_state, double *restrict grad_loo)

Wrapper for ComputeGradLogLikelihood(); see that function for details.

void ComputeHessianObjectiveFunction(StateType * log_likelihood_state, double *restrict hessian_loo)

Wrapper for ComputeHessianLogLikelihood(); see that function for details.

void FillLogLikelihoodState(StateType * log_likelihood_state)

Sets up the LeaveOneOutLogLikelihoodState object so that it can be used to compute log marginal and its gradients. ASSUMES all needed space is ALREADY ALLOCATED.

This function should not be called directly; instead use LeaveOneOutLogLikelihoodState::SetupState.

Parameters:
log_likelihood_state[1]:
 constructed state object with appropriate sized allocations
Outputs:
log_likelihood_state[1]:
 fully configured state object, ready for use by this class’s member functions

double ComputeLogLikelihood(const StateType & log_likelihood_state)

Computes the log LOO-CV pseudo-likelihood That is, split the training data (X, y) into |y| training set groups, where in the i-th group, the validation set is the i-th point of (X, y) and the training set is (X, y) with the i-th point removed. Then this likelihood measures the aggregate performance of the ability of a model built on each “leave one out” training set to predict the corresponding validation set. So in some sense it is a measure of model consitency, ensuring that we do not perform well on a few points while doing horribly on the others.

Parameters:
log_likelihood_state:
 properly configured state oboject
Returns:
natural log of the leave one out cross validation pseudo-likelihood of the GP model

Computes the Leave-One-Out Cross Validation log pseudo-likelihood.

Let \log p(y_i | X_{-i}, y_{-i}, \theta) = -0.5\log(\sigma_i^2) - 0.5*(y_i - \mu_i)^2/\sigma_i^2 - 0.5\log(2\pi)

Then we compute:

L_{LOO}(X, y, \theta) = \sum_{i = 1}^n \log p(y_i | X_{-i}, y_{-i}.

where X_{-i} and y_{-i} are the training data with the i-th point removed. Then X_i is taken as the point to sample. \sigma_i^2 and \mu_i are the GP (predicted) variance/mean at the point to sample.

This function currently uses LeaveOneOutCoreWithMatrixInverse() to compute \sigma_i^2 and \mu_i, which is potentially ill-conditioned. This has not proven to be an issue in testing, but _accurate() is preferred when loss of prescision is suspected.

See Rasmussen & Williams 5.4.2 for more details.

void ComputeGradLogLikelihood(StateType * log_likelihood_state, double *restrict grad_loo)

Computes the (partial) derivatives of the leave-one-out cross validation log pseudo-likelihood with respect to each hyperparameter of our covariance function.

Let n_hyper = covariance_ptr->GetNumberOfHyperparameters();

Parameters:
log_likelihood_state[1]:
 properly configured state oboject
Outputs:
log_likelihood_state[1]:
 state with temporary storage modified
grad_loo[n_hyper]:
 gradient of leave one out cross validation log likelihood wrt each hyperparameter of covariance

Computes the gradients (wrt hyperparameters, \theta) of the Leave-One-Out Cross Validation log pseudo-likelihood, L_{LOO}.

See function definition get_leave_one_out_likelihood() function defn docs for definition of L_{LOO}. We compute:

\pderiv{L_{LOO}}{\theta_j} = \sum_{i = 1}^n \frac{1}{(K^-1)_ii} *
             \left(\alpha_i[Z_j\alpha]_i - 0.5(1 + \frac{\alpha_i^2}{(K^-1)_ii})[Z_j K^-1]_ii \right)

where \alpha = K^-1 * y, and Z_j = K^-1 * \pderiv{K}{\theta_j}.

Note that formation of [Z_j * K^-1] = K^-1 * \pderiv{K}{\theta_j} * K^-1 requires some care. We prefer not to use the explicit inverse whenever possible. But we are readily able to compute A^-1 * B via “backsolve” (of a factored A), so we do:

A := K^-1 * Z^T
result := A^T gives us the desired [Z_j * K^-1] without forming K^-1.

Note that this formulation uses (K^-1)_ii directly to avoid the (very high) expense of evaluating the GP mean, variance n times. This is analogous to using LeaveOneOutCoreWithMatrixInverse() over LeaveOneOutCoreAccurate(), which is an assumption that seems reasonable now but may need revisiting later.

See Rasmussen & Williams 5.4.2 for details.

void ComputeHessianLogLikelihood(StateType * log_likelihood_state, double *restrict hessian_loo)

NOT IMPLEMENTED. Kludge to make it so that I can instantiate MultistartNewtonOptimization<> with LeaveOneOutLogLikelihoodEvaluator in gpp_python.cpp. It is an error to select NewtonOptimization with LeaveOneOutLogLikelihoodEvaluator, but I can’t find a nicer way to generate this error while still being able to treat MultistartNewtonOptimization<> generically.

NOT IMPLEMENTED Kludge to make it so that I can use general template code w/o special casing LeaveOneOutLogLikelihoodEvaluator.

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(LeaveOneOutLogLikelihoodEvaluator)

Public Static Attributes

constexpr char const * kName

string name of this log likelihood evaluator for logging

Private Functions

void BuildHyperparameterGradCovarianceMatrix(StateType * log_likelihood_state)

Constructs the tensor of gradients (wrt hyperparameters) of the covariance function at all pairs of points_sampled_.

The result is stored in state->grad_hyperparameter_cov_matrix. So we are computing \pderiv{cov(X_i, X_j)}{\theta_k}. These data are ordered as: grad_hyperparameter_cov_matrix[i][j][k] (i.e., num_hyperparmeters matrices of size Square(num_sampled_)).

Note

grad_hyperparameter_cov_matrix[i][j][k] == grad_hyperparameter_cov_matrix[j][i][k]

Parameters:
log_likelihood_state[1]:
 properly configured state object
Outputs:
log_likelihood_state[1]:
 state with grad_hyperparameter_cov_matrix filled

Private Members

const int dim_

spatial dimension (e.g., entries per point of points_sampled)

int num_sampled_

number of points in points_sampled

std::vector< double > points_sampled_

coordinates of already-sampled points, X

std::vector< double > points_sampled_value_

function values at points_sampled, y

std::vector< double > noise_variance_

\sigma_n^2, the noise variance

class LeaveOneOutLogLikelihoodState

State object for LeaveOneOutLogLikelihoodEvaluator. This object tracks the covariance object as well as derived quantities that (along with the training points/values in the Evaluator class) fully specify the log marginal likelihood. Since this is used to optimize the log marginal likelihood, the covariance’s hyperparameters are variable.

See general comments on State structs in gpp_common.hpp’s header docs.

Public Type

typedef LeaveOneOutLogLikelihoodEvaluator EvaluatorType

Public Functions

LeaveOneOutLogLikelihoodState(const EvaluatorType & log_likelihood_eval, const CovarianceInterface & covariance_in)

Constructs a LeaveOneOutLogLikelihoodState object with a specified covariance object (in particular, new hyperparameters). Ensures all state variables & temporaries are properly sized. Properly sets all state variables so that the Evaluator can be used to compute log marginal likelihood, gradients, etc.

Warning

This object’s state is INVALIDATED if the log_likelihood_eval used in construction is mutated! SetupState() should be called again in such a situation.

Parameters:
log_likelihood_eval:
 LogMarginalLikelihoodEvaluator object that this state is being used with
covariance_in:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data

LeaveOneOutLogLikelihoodState(LeaveOneOutLogLikelihoodState && other)

int GetProblemSize()

void SetCurrentPoint(const EvaluatorType & log_likelihood_eval, double const *restrict hyperparameters)

void GetCurrentPoint(double *restrict hyperparameters)

void GetHyperparameters(double *restrict hyperparameters)

Get hyperparameters of underlying covariance function.

Outputs:
hyperparameters[num_hyperparameters]:
 covariance’s hyperparameters

void SetHyperparameters(const EvaluatorType & log_likelihood_eval, double const *restrict hyperparameters)

Change the hyperparameters of the underlying covariance function. Update the state’s derived quantities to be consistent with the new hyperparameters.

Parameters:
log_likelihood_eval:
 LeaveOneOutLogLikelihoodEvaluator object that this state is being used with
hyperparameters[num_hyperparameters]:
 hyperparameters to change to

void SetupState(const EvaluatorType & log_likelihood_eval, double const *restrict hyperparameters)

Configures this state object with new hyperparameters. Ensures all state variables & temporaries are properly sized. Properly sets all state variables for log likelihood (+ gradient) evaluation.

Warning

This object’s state is INVALIDATED if the log_likelihood used in SetupState is mutated! SetupState() should be called again in such a situation.

Parameters:
log_likelihood_eval:
 log likelihood evaluator object that describes the training/already-measured data
hyperparameters[num_hyperparameters]:
 hyperparameters to change to

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(LeaveOneOutLogLikelihoodState)

Public Members

const int dim

spatial dimension (e.g., entries per point of points_sampled)

int num_sampled

number of points in points_sampled

int num_hyperparameters

number of hyperparameters of covariance; i.e., covariance_ptr->GetNumberOfHyperparameters()

std::unique_ptr< CovarianceInterface > covariance_ptr

covariance class (for computing covariance and its gradients)

std::vector< double > K_chol

cholesky factorization of K

std::vector< double > K_inv

K^-1

std::vector< double > K_inv_y

K^-1 * y; computed WITHOUT forming K^-1

std::vector< double > grad_hyperparameter_cov_matrix

\pderiv{K_{ij}}{\theta_k}; temporary b/c it is overwritten with each computation of GradLikelihood

std::vector< double > Z_alpha

temporary: K^-1 * grad_hyperparameter_cov_matrix * K^-1 * y

std::vector< double > Z_K_inv

temporary: K^-1 * grad_hyperparameter_cov_matrix * K^-1

class LogMarginalLikelihoodEvaluator

This serves as a quick summary of the Log Marginal Likelihood (LML). Please see the file comments here and in the corresponding .cpp file for further details.

Class for computing the Log Marginal Likelihood. Given a particular covariance function (including hyperparameters) and training data ((point, function value, measurement noise) tuples), the log marginal likelihood is the log probability that the data were observed from a Gaussian Process would have generated the observed function values at the given measurement points. So log marginal likelihood tells us “the probability of the observations given the assumptions of the model.” Log marginal sits well with the Bayesian Inference camp. (Rasmussen & Williams p118)

This quantity primarily deals with the trade-off between model fit and model complexity. Handling this trade-off is automatic in the log marginal likelihood calculation. See Rasmussen & Williams 5.2 and 5.4.1 for more details.

We can use the log marginal likelihood to determine how good our model is. Additionally, we can maximize it by varying hyperparameters (or even changing covariance functions) to improve our model quality. Hence this class provides access to functions for computing log marginal likelihood and its hyperparameter gradients.

Note

These class comments are duplicated in Python: cpp_wrappers.log_likelihood.LogMarginalLikelihood

Public Type

typedef LogMarginalLikelihoodState StateType

Public Functions

LogMarginalLikelihoodEvaluator(double const *restrict points_sampled_in, double const *restrict points_sampled_value_in, double const *restrict noise_variance_in, int dim_in, int num_sampled_in)

Constructs a LogMarginalLikelihoodEvaluator object. All inputs are required; no default constructor nor copy/assignment are allowed.

Parameters:
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
points_sampled[dim][num_sampled]:
 points that have already been sampled
points_sampled_value[num_sampled]:
 values of the already-sampled points
noise_variance[num_sampled]:
 the \sigma_n^2 (noise variance) associated w/observation, points_sampled_value
dim:the spatial dimension of a point (i.e., number of independent params in experiment)
num_sampled:number of already-sampled points

int dim()

int num_sampled()

double ComputeObjectiveFunction(StateType * log_likelihood_state)

Wrapper for ComputeLogLikelihood(); see that function for details.

void ComputeGradObjectiveFunction(StateType * log_likelihood_state, double *restrict grad_log_marginal)

Wrapper for ComputeGradLogLikelihood(); see that function for details.

void ComputeHessianObjectiveFunction(StateType * log_likelihood_state, double *restrict hessian_log_marginal)

Wrapper for ComputeHessianLogLikelihood(); see that function for details.

void FillLogLikelihoodState(StateType * log_likelihood_state)

Sets up the LogMarginalLikelihoodState object so that it can be used to compute log marginal and its gradients. ASSUMES all needed space is ALREADY ALLOCATED.

This function should not be called directly; instead use LogMarginalLikelihoodState::SetupState.

Parameters:
log_likelihood_state[1]:
 constructed state object with appropriate sized allocations
Outputs:
log_likelihood_state[1]:
 fully configured state object, ready for use by this class’s member functions

double ComputeLogLikelihood(const StateType & log_likelihood_state)

Computes the log marginal likelihood, log(p(y | X, \theta)). That is, the probability of observing the training values, y, given the training points, X, and hyperparameters (of the covariance function), \theta.

This is a measure of how likely it is that the observed values came from our Gaussian Process Prior.

Parameters:
log_likelihood_state:
 properly configured state oboject
Returns:
natural log of the marginal likelihood of the GP model

Note

These comments have been copied into the matching method of LogMarginalLikelihood in python_version/log_likelihood.py.

log p(y | X, \theta) = -\frac{1}{2} * y^T * K^-1 * y - \frac{1}{2} * \log(det(K)) - \frac{n}{2} * \log(2*pi)

where n is num_sampled, \theta are the hyperparameters, and \log is the natural logarithm. In the following,

  • term1 = -\frac{1}{2} * y^T * K^-1 * y
  • term2 = -\frac{1}{2} * \log(det(K))
  • term3 = -\frac{n}{2} * \log(2*pi)

For an SPD matrix K = L * L^T,

det(K) = \Pi_i L_ii^2

We could compute this directly and then take a logarithm. But we also know:

\log(det(K)) = 2 * \sum_i \log(L_ii)

The latter method is (currently) preferred for computing \log(det(K)) due to reduced chance for overflow and (possibly) better numerical conditioning.

void ComputeGradLogLikelihood(StateType * log_likelihood_state, double *restrict grad_log_marginal)

Computes the (partial) derivatives of the log marginal likelihood with respect to each hyperparameter of our covariance function.

Let n_hyper = covariance_ptr->GetNumberOfHyperparameters();

Parameters:
log_likelihood_state[1]:
 properly configured state oboject
Outputs:
log_likelihood_state[1]:
 state with temporary storage modified
grad_log_marginal[n_hyper]:
 gradient of log marginal likelihood wrt each hyperparameter of covariance

void ComputeHessianLogLikelihood(StateType * log_likelihood_state, double *restrict hessian_log_marginal)

Constructs the Hessian matrix of the log marginal likelihood function. This matrix is symmetric. It is also negative definite near maxima of the log marginal.

See HyperparameterHessianCovariance() docs in CovarianceInterface (gpp_covariance.hpp) for details on the structure of the Hessian matrix.

Parameters:
log_likelihood_state[1]:
 properly configured state oboject
Outputs:
log_likelihood_state[1]:
 state with temporary storage modified
hessian_log_marginal[n_hyper][n_hyper]:
 (i,j)-th entry is \mixpderiv{LML}{\theta_i}{\theta_j}, where LML = log(p(y | X, \theta))

Computes the Hessian matrix of the log (marginal) likelihood wrt the hyperparameters:

\mixpderiv{log(p(y | X, \theta_k))}{\theta_i}{\theta_j} =
    (-\alpha * \pderiv{K}{\theta_i} * K^-1 * \pderiv{K}{\theta_j} * \alpha)
  + (\alpha * \mixpderiv{K}{\theta_i}{\theta_j} * \alpha)
  - 0.5 * tr(-K^-1 * \pderiv{K}{\theta_i} * K^-1 * \pderiv{K}{\theta_j} + K^-1 * \mixpderiv{K}{\theta_i}{\theta_j})

Note that as usual, K is the covariance matrix (bearing its own two indices, say K_{k,l}) which are omitted here.

This expression arises from differentating each entry of the gradient (see function comments for LogMarginalLikelihoodEvaluator::ComputeGradLogLikelihood for expression) of the log marginal wrt each hyperparameter.

We use the identity: \pderiv{K^-1}{X} = -K^-1 * \pderiv{K}{X} * K^-1; as well as the fact that \partial tr(A) = tr(\partial A). That is, since trace is linear, the order can be interchanged with the differential operator; and the various symmetries of the gradient/hessians of K (see function declaration comments for details on symmetry).

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(LogMarginalLikelihoodEvaluator)

Public Static Attributes

constexpr char const * kName

string name of this log likelihood evaluator for logging

Private Functions

void BuildHyperparameterGradCovarianceMatrix(StateType * log_likelihood_state)

Constructs the tensor of gradients (wrt hyperparameters) of the covariance function at all pairs of points_sampled_.

The result is stored in state->grad_hyperparameter_cov_matrix. So we are computing \pderiv{cov(X_i, X_j)}{\theta_k}. These data are ordered as: grad_hyperparameter_cov_matrix[i][j][k] (i.e., num_hyperparmeters matrices of size Square(num_sampled_)).

Note

grad_hyperparameter_cov_matrix[i][j][k] == grad_hyperparameter_cov_matrix[j][i][k]

Parameters:
log_likelihood_state[1]:
 properly configured state object
Outputs:
log_likelihood_state[1]:
 state with grad_hyperparameter_cov_matrix filled

void BuildHyperparameterHessianCovarianceMatrix(StateType * log_likelihood_state, double * hessian_hyperparameter_cov_matrix)

Constructs the tensor of hessians (wrt hyperparameters) of the covariance function at all pairs of points_sampled_. The result is \mixpderiv{cov(X_i, X_j)}{\theta_k}{\theta_l}, stored in hessian_hyperparameter_cov_matrix[i][j][k][l].

Note

this tensor has several symmetries: A[i][j][k][l] == A[j][i][k][l] and A[i][j][k][l] == A[i][j][l][k].

Parameters:
log_likelihood_state[1]:
 properly configured state object
Outputs:
log_likelihood_state[1]:
 state object with temporary storage modified
hessian_hyperparameter_cov_matrix[num_sampled][num_sampled][n_hyper][n_hyper]:
 (i,j,k,l)-th entry is \mixpderiv{cov(X_i, X_j)}{\theta_k}{\theta_l}

Private Members

const int dim_

spatial dimension (e.g., entries per point of points_sampled)

int num_sampled_

number of points in points_sampled

std::vector< double > points_sampled_

coordinates of already-sampled points, X

std::vector< double > points_sampled_value_

function values at points_sampled, y

std::vector< double > noise_variance_

\sigma_n^2, the noise variance

class LogMarginalLikelihoodState

State object for LogMarginalLikelihoodEvaluator. This object tracks the covariance object as well as derived quantities that (along with the training points/values in the Evaluator class) fully specify the log marginal likelihood. Since this is used to optimize the log marginal likelihood, the covariance’s hyperparameters are variable.

See general comments on State structs in gpp_common.hpp’s header docs.

Public Type

typedef LogMarginalLikelihoodEvaluator EvaluatorType

Public Functions

LogMarginalLikelihoodState(const EvaluatorType & log_likelihood_eval, const CovarianceInterface & covariance_in)

Constructs a LogMarginalLikelihoodState object with a specified covariance object (in particular, new hyperparameters). Ensures all state variables & temporaries are properly sized. Properly sets all state variables so that the Evaluator can be used to compute log marginal likelihood, gradients, etc.

Warning

This object’s state is INVALIDATED if the log_likelihood_eval used in construction is mutated! SetupState() should be called again in such a situation.

Parameters:
log_likelihood_eval:
 LogMarginalLikelihoodEvaluator object that this state is being used with
covariance_in:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data

LogMarginalLikelihoodState(LogMarginalLikelihoodState && other)

int GetProblemSize()

void SetCurrentPoint(const EvaluatorType & log_likelihood_eval, double const *restrict hyperparameters)

void GetCurrentPoint(double *restrict hyperparameters)

void GetHyperparameters(double *restrict hyperparameters)

Get hyperparameters of underlying covariance function.

Outputs:
hyperparameters[num_hyperparameters]:
 covariance’s hyperparameters

void SetHyperparameters(const EvaluatorType & log_likelihood_eval, double const *restrict hyperparameters)

Change the hyperparameters of the underlying covariance function. Update the state’s derived quantities to be consistent with the new hyperparameters.

Parameters:
log_likelihood_eval:
 LogMarginalLikelihoodEvaluator object that this state is being used with
hyperparameters[num_hyperparameters]:
 hyperparameters to change to

void SetupState(const EvaluatorType & log_likelihood_eval, double const *restrict hyperparameters)

Configures this state object with new hyperparameters. Ensures all state variables & temporaries are properly sized. Properly sets all state variables for log likelihood (+ gradient) evaluation.

Warning

This object’s state is INVALIDATED if the log_likelihood used in SetupState is mutated! SetupState() should be called again in such a situation.

Parameters:
log_likelihood_eval:
 log likelihood evaluator object that describes the training/already-measured data
hyperparameters[num_hyperparameters]:
 hyperparameters to change to

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(LogMarginalLikelihoodState)

Public Members

const int dim

spatial dimension (e.g., entries per point of points_sampled)

int num_sampled

number of points in points_sampled

int num_hyperparameters

number of hyperparameters of covariance; i.e., covariance_ptr->GetNumberOfHyperparameters()

std::unique_ptr< CovarianceInterface > covariance_ptr

covariance class (for computing covariance and its gradients)

std::vector< double > K_chol

cholesky factorization of K

std::vector< double > K_inv_y

K^-1 * y; computed WITHOUT forming K^-1

std::vector< double > grad_hyperparameter_cov_matrix

\pderiv{K_{ij}}{\theta_k}; temporary b/c it is overwritten with each computation of GradLikelihood

std::vector< double > temp_vec

temporary storage space of size num_sampled

gpp_model_selection.cpp

Table of Contents:

  1. FILE OVERVIEW

  2. MATHEMATICAL OVERVIEW

    1. LOG LIKELIHOOD METRICS OF MODEL QUALITY

      1. BAYESIAN MODEL SELECTION
        1. LOG MARGINAL LIKELIHOOD (LML)
      2. CROSS VALIDATION (CV)
      1. LEAVE ONE OUT CROSS VALIDATION (LOO-CV)
      1. REMARKS
  3. CODE HIERARCHY

1. FILE OVERVIEW

As a preface, if you are not already familiar with GPs and their implementation, you should read the file comments for gpp_math.hpp/cpp first. If you are unfamiliar with the concept of model selection or optimization methods, please read the file comments for gpp_model_selection.hpp first.

This file provides implementations for various log likelihood measures of model quality (marginal likelihood, leave one out cross validation). The functions to optimize these measures all live in the header file (they are templated) and the hearts of the optimization routines are in gpp_optimization.hpp.

2. MATHEMATICAL OVERVIEW

2a. LOG LIKELIHOOD METRICS OF MODEL QUALITY

2a, i. BAYESIAN MODEL SELECTION

(Rasmussen & Williams, 5.2) Bayesian model selection uses Bayesian inference to predict various distributional properties of models and their parameters. This analysis is usually performed hierarchically. The pararameters, w, are at the lowest level; e.g., the weights in linear regression. The hyperparameters, \theta, sit at the next level; these are free-floating parameters that control model performance such as length scales. Finally, the highest level is a discrete set of models, H_i; e.g., GPs resulting from different classes of covariance or entirely different models.

Then the “posterior over the parameters” is:

p(w | y, X, \theta, H_i) = \frac{p(y | X, w, H_i) * p(w | \theta, H_i)}{p(y | X, \theta, H_i)}

where p(y | X, w, H_i) is the “likelihood” (of the model) and p(w | \theta, H_i) is the “parameter prior” (encoding what we know about the model parameters before seeing data). The denominator is the “marginal likelihood.” Using total probability, it is given as:

p(y | X, \theta, H_i) = \int p(y | X, w, H_i) * p(w | \theta, H_i) dw,

where we have marginalized out w ~ p(w | \theta, H_i) from the likelihood, p(y | X, w, H_i), to produce the marginal likelihood. This is just the integral of the numerator over w, so it can be viewed as a normalizing constant too–however you like think about Bayes’ Theorem.

From that marginal likelihood, we can also produce the posterior over hyperparameters:

p(\theta | y, X, H_i) = \frac{p(y | X, \theta, H_i * p(\theta, H_i)}{p(y | X, H_i)}

where p(\theta, H_i) is called the “hyper-prior” and the denominator is constructed as before.

And finally, the posterior for the models:

p(H_i | y, X) = \frac{p(y | X, H_i) * p(H_i)}{p(y | X)}

Here p(y | X) is not an integral since H_i is discrete: = \sum_i p(y | X, H_i) * p(H_i)

These integrals can be extremely complicated, often requiring Monte-Carlo (MC) integration. In particular, computing the posterior over hyperparameters is usually particularly painful. This would be the ideal step in the Bayesian framework for selecting hyperparameters; with the posterior distribution we can simply choose the most likely. Using the higher level analysis also requires knowing or forming the priors over hyperparameters and/or models, which can also be tricky when information is lacking.

To make the problem more tractable, people usually end up working with maximization of the marginal likelihood wrt \theta. This process can be tricky: if we parameterize everything in our model (many \theta), it is easy to overfit and produce nonsense where the model reacts strongly to noise.

One nice property of the marginal likelihood is that it automatically trades off between model complexity and data fit. In the next section, we will make this explicit for GP-based models. But it is true in general. First, note that marginal likelihood is a probability distribution so it integrates to 1. A simple model (with few parameters), can only explain a few data sets and the likelihood will be high for these and 0 for the rest. A very complex model can explain many data sets, so it will be nonzero over a wider region but never obtain values as high as the simple model. Marginal likelihood optimization trades off between these, in principle automatically finding the simplest model that still explains the data.

Note: we are usually only working with one model (the GP with a specified class of covariance function), so we drop H_i.

2a, i, A. LOG MARGINAL LIKELIHOOD (LML)

(Rasmussen & Williams, 5.4.1) Now we specialize the Bayesian technique for GPs. We will be working with the log marginal likelihood (LML). GPs are non-parametric in the sense that we are not directly computing parameters \beta_i to evaluate y_i = x_i\beta_i as in linear regression. However, the function values f at the training points (points_sampled) are analogous to parameters; the more sampled points, the more complex the model (more params).

Then the log marginal likelihood, \log(p(y | X, \theta)), examines the probability of the model given the data. It is:

log p(y | X, \theta) = -\frac{1}{2} * y^T * K^-1 * y - \frac{1}{2} * \log(det(K)) - \frac{n}{2} * \log(2*pi)  (Equation 1)

where n is num_sampled, \theta are the hyperparameters, and \log is the natural logarithm.

To maximize p, we can equivalently maximize log(p).

Since we almost never work with noise-free priors, we drop the subscript y from K_y in future discussion; e.g,. in LogMarginalLikelihoodEvaluator::ComputeLogLikelihood().

Anyway, despite the complex integrals and whatnot in the general Bayesian model inference method, the LML for GPs is very easy to derive. From the discussion in gpp_math.hpp/cpp, it should be clear that the GP is distributed like a multi-variate Gaussian:

N(\mu, K) = \frac{1}{\sqrt{(2\pi)^n * det(K)}} * \exp(-\frac{1}{2}*(y-\mu)^T * K^-1 * (y - \mu))

And our GP ~ N(0, K); hence p(y | X, \theta) ~ N(0,K) by definition. Take the logarithm and we reach Equation 1.

Let’s look at the terms:

  • term1 = -\frac{1}{2} * y^T * K^-1 * y
  • term2 = -\frac{1}{2} * \log(det(K))
  • term3 = -\frac{n}{2} * \log(2*pi)

In detail:

  • term1: the only term that depends on the observed function values, y. This is called the “data-fit.” The data fit decreases monotonically as covariance length scales (part of hyperparameters) increase since long lengths force the model to change ‘slowly’, making it less flexible.
  • term2: this term is the complexity penalty, depending only on K. One can think of complexity as a concrete measure of how “bumpy” (short length scales, high frequency) or “not-bumpy” (long length scales, low frequency) the distribution is.* This term increases with length scale; low frequency => low complexity.
  • term3: the simplest term, this is just from normalization (so the hyper-volume under the hyper-surface is 1)

* Here we’re talking about the variance of the distribution, not the mean since term2 only deals with K; e.g., imagine plotting the variance or the 95% confidence region.

Hence optimizing LML is a matter of balancing data fit with model complexity. We made this same observation about LML in the general discussion about Bayesian model selection, arguing about properties of distributions; here we see the explicit terms responsible for the trade-off.

This final point is important so here it is again, verbatim from the hpp: This is not magic. Using GPs as an example, if the covariance function is completely mis-specified, we can blindly go through with marginal likelihood optimization, obtain an “optimal” set of hyperparameters, and proceed... never realizing that our fundamental assumptions are wrong. So care is always needed.

2a, ii. CROSS VALIDATION (CV)

(Rasmussen & Williams, Chp 5.3) Cross validation deals with estimating the generalization error. This is done by splitting the training data, X, into two disjoint sets: X' and V. X' is the reduced training set and V is the validation set. X = X' \cup V and X' \cap V = \emptyset.

Then we train the model on X' and evaluate its performance on V (where we know the answer from direct obsevation, since V \subset X). The errors in prediction on V serve as a proxy for the generalization error.

If |V| is too small, large variance in the estimated error can result (e.g., what if we pick particularly “unlucky” data for V?). But choosing a small X' leads to a model that is too poorly trained to provide useful outputs. Instead, a common technique is to choose multiple disjoint V and run error estimation on each of them.

Taking this idea to the extreme, we choose n sets V with |V| = 1. Hence the name “Leave One Out,” since each member of X takes a turn being the sole validation point.

Finally, what measure do we use to evaluate the performance of the model (trained on X') on V? According to R&W, the most common measure is the squared error loss, but for probabilistic models like GPs, the log probability loss makes more sense.

2a, ii, A. LEAVE ONE OUT CROSS VALIDATION (LOO-CV)

(Rasmussen & Williams, Chp 5.4.2) For a GP, LOO-CV, which we denote L_{LOO} is:

Let \log p(y_i | X_{-i}, y_{-i}, \theta) = -0.5\log(\sigma_i^2) - 0.5*(y_i - \mu_i)^2/\sigma_i^2 - 0.5\log(2\pi). Then we compute:

L_{LOO}(X ,y, \theta) = \sum_{i = 1}^n \log p(y_i | X_{-i}, y_{-i}).

where X_{-i} and y_{-i} are the training data with the i-th point removed. Then X_i is taken as the point to sample. \sigma_i^2 and \mu_i are the GP (predicted) variance/mean at the point to sample, X \ X_{-i}.

On the surface, it looks like we would have to form an entirely new GP n times to compute \mu_i, \sigma_i^2 for i = 1..n. However, in each of these cases, the X_{-i}, f_{-i}, \sigma_n_{-i} inputs to the GP are almost identical, so there is a lot of nearly redundant work going on. If we take K computed with the full training data (X, f, \sigma_n), and remove the i-th row and i-th column, we get the K_{-i} associated with X_{-i}, f_{-i}, \sigma_n_{-i}. And then the fundamental GP operations involve applying K_{-i}^-1: see Equation 2, 3 in gpp_math.cpp.

By partitioning K into 4 block matrices, we can (after row/column swap) isolate the element being removed, and solve for it in terms of K^-1. Wikipedia for block matrix inverse: http://en.wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion We can then directly show that:

\mu_i = y_i - \alpha_i / K^{-1}_{ii}
\sigma_i^2 = 1/K^{-1}_{ii}

where \alpha = K^-1 * y

2a, iii. REMARKS

We have not been able to find much information on whether LML or LOO-CV should be preferred. Rasmussen & Williams say, “[the LML gives] the probability of the observations emph{given the assumptions of the model}.” And “the [frequentist LOO-CV] gives an estimate for the predictive probability, whether or not the assumptions of the model may be fulfilled.” Thus Wahba (1990, sec 4.8) argues that LOO-CV should be more robust to model mis-specification (e.g., wrong class of covariance function).

3. CODE HIERARCHY

There are currently several top-level entry points for model selection (defined in the hpp) including ‘dumb’ search, gradient descent, and Newton:

  • LatinHypercubeSearchHyperparameterOptimization:

    • Estimates the best model fit with a ‘dumb’ search over hyperparameters

    • Selects random guesses based on latin hypercube sampling

    • This calls:

      EvaluateLogLikelihoodAtPointList:

      • Evaluates the selected log likelihood measure at each set of hyperparameters

      • Multithreaded over starting locations

      • This calls:

        MultistartOptimizer<...>::MultistartOptimize(...) for multistarting (see gpp_optimization.hpp) with the NullOptimizer

  • MultistartGradientDescentHyperparameterOptimization:

    • Finds the best model by optimizing hyperparmeters to find maxima of log likelihood metrics

    • Selects random starting locations based on latin hypercube sampling

    • Multithreaded over starting locations

    • Optimizes with restarted gradient descent; collects results and updates the solution as new optima are found

    • This calls:

      MultistartOptimizer<...>::MultistartOptimize(...) for multistarting (see gpp_optimization.hpp) together with GradientDescentOptimizer::Optimize<ObjectiveFunctionEvaluator, Domain>() (see gpp_optimization.hpp)

  • MultistartNewtonHyperparameterOptimization: (Recommended)

    • Finds the best model by optimizing hyperparmeters to find maxima of log likelihood metrics

    • Selects random starting locations based on latin hypercube sampling

    • Multithreaded over starting locations

    • Optimizes with (modified) Newton’s Method; collects results and updates the solution as new optima are found

    • This calls:

      MultistartOptimizer<...>::MultistartOptimize(...) for multistarting (see gpp_optimization.hpp) together with NewtonOptimizer::Optimize<ObjectiveFunctionEvaluator, Domain>() (see gpp_optimization.hpp)

At the moment, we have two choices for the template parameter LogLikelihoodEvaluator: LML and LOO-CV. Each of these make additional lower level calls to gpp_linear_algebra routines and gpp_covariance routines. The details (with derivations and optimizations where appropriate) are specified in the function implementation docs and will not be repeated here.

  • LogMarginalLikelihoodEvaluator:

    At the bottom level, LogMarginalLikelihoodEvaluator contains member functions for computing the LML, its gradient wrt hyperparameters (of covariance), and its hessian wrt hyperparameters. Its data members are the GP model inputs (sampled points, function values at sampled points, noise). It does not know what a covariance is since the covariance (i.e., hyperparameters) is meant to change during optimization.

    Its member functions require a LogMarginalLikelihoodState object which is where the (stateful) covariance is kept, along with derived quantities that are a function of covariance and model inputs, and various temporaries.

    Computations optionally use a faster implementation using explicit matrix inverses; this is poorly conditioned but several times faster.

  • LeaveOneOutLogLikelihoodEvaluator:

    This class contains member fucntions for computing the LOO-CV measure. Its structure is essentially the same as LogMarginalLikelihoodEvaluator. Its members require LeaveOneOutLogLikelihoodState, just as before.

    In the discussion of LOO-CV, we indicated two possible methods to compute the quantities \mu_i, \sigma_i^2. Both are implemented in the code, although it is currently configured to use the faster computation:

    • LeaveOneOutCoreAccurate() computes mu_i, sigma_i^2 the direct way by forming a new GP. This is slow but well-conditioned.
    • LeaveOneOutCoreWithMatrixInverse() computes mu_i, sigma_i^2 using the described “trick”. This is fast but the results may be heavily affected by numerical error (if K is poorly conditioned).

Defines

OL_USE_INVERSE

Note

These comments have been copied into the matching method of LogMarginalLikelihood in python_version/log_likelihood.py.

Computes:

\pderiv{log(p(y | X, \theta))}{\theta_k} = \frac{1}{2} * y_i * \pderiv{K_{ij}}{\theta_k} * y_j -
                                           \frac{1}{2} * trace(K^{-1}_{ij}\pderiv{K_{ij}}{\theta_k})

Or equivalently:

= \frac{1}{2} * trace([\alpha_i \alpha_j - K^{-1}_{ij}]*\pderiv{K_{ij}}{\theta_k}),

where \alpha_i = K^{-1}_{ij} * y_j

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.