gpp_math

Contents:

gpp_math.hpp

  1. OVERVIEW OF GAUSSIAN PROCESSES AND EXPECTED IMPROVEMENT; WHAT ARE WE TRYING TO DO?
  2. FILE OVERVIEW
  3. IMPLEMENTATION NOTES
  4. NOTATION
  5. CITATIONS

1. OVERVIEW OF GAUSSIAN PROCESSES AND EXPECTED IMPROVEMENT; WHAT ARE WE TRYING TO DO?

Note

these comments are copied in Python: interfaces/__init__.py

At a high level, this file optimizes an objective function \(\, f(x)\). This operation requires data/uncertainties about prior and concurrent experiments as well as a covariance function describing how these data [are expected to] relate to each other. The points x represent experiments. If \(\, f(x)\) is say, survival rate for a drug test, the dimensions of x might include dosage amount, dosage frequency, and overall drug-use time-span.

The objective function is not required in closed form; instead, only the ability to sample it at points of interest is needed. Thus, the optimization process cannot work with \(\, f(x)\) directly; instead a surrogate is built via interpolation with Gaussian Proccesses (GPs).

Following Rasmussen & Williams (2.2), a Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Defn 2.1). Hence a GP is fully specified by its mean function, \(\, m(x)\), and covariance function, \(\, k(x,x')\). Then we assume that a real process \(\, f(x)\) (e.g., drug survival rate) is distributed like:

\[f(x) ~ GP(m(x), k(x,x'))\]

with

\[m(x) = E[f(x)], k(x,x') = E[(f(x) - m(x))*(f(x') - m(x'))].\]

Then sampling from \(\, f(x)\) is simply drawing from a Gaussian with the appropriate mean and variance.

However, since we do not know \(\, f(x)\), we cannot precisely build its corresponding GP. Instead, using samples from \(\, f(x)\) (e.g., by measuring experimental outcomes), we can iteratively improve our estimate of \(\, f(x)\). See GaussianProcess class docs and implementation docs for details on how this is done.

The optimization process models the objective using a Gaussian process (GP) prior (also called a GP predictor) based on the specified covariance and the input data (e.g., through member functions ComputeMeanOfPoints, ComputeVarianceOfPoints). Using the GP, we can compute the expected improvement (EI) from sampling any particular point. EI is defined relative to the best currently known value, and it represents what the algorithm believes is the most likely outcome from sampling a particular point in parameter space (aka conducting a particular experiment).

See ExpectedImprovementEvaluator and OnePotentialSampleExpectedImprovementEvaluator class docs for further details on computing EI. Both support ComputeExpectedImprovement() and ComputeGradExpectedImprovement().

The dimension of the GP is equal to the number of simultaneous experiments being run; i.e., the GP may be multivariate. The behavior of the GP is controlled by its underlying covariance function and the data/uncertainty of prior points (experiments).

With the ability the compute EI, the final step is to optimize to find the best EI. This is done using multistart gradient descent (MGD), in ComputeOptimalPointsToSample(). This method wraps a MGD call and falls back on random search if that fails. See gpp_optimization.hpp for multistart/optimization templates. This method can evaluate and optimize EI at serval points simultaneously; e.g., if we wanted to run 4 simultaneous experiments, we can use EI to select all 4 points at once.

The literature (e.g., Ginsbourger 2008) refers to these problems collectively as q-EI, where q is a positive integer. So 1-EI is the originally dicussed usage, and the previous scenario with multiple simultaneous points/experiments would be called 4-EI.

Additionally, there are use cases where we have existing experiments that are not yet complete but we have an opportunity to start some new trials. For example, maybe we are a drug company currently testing 2 combinations of dosage levels. We got some new funding, and can now afford to test 3 more sets of dosage parameters. Ideally, the decision on the new experiments should depend on the existence of the 2 ongoing tests. We may not have any data from the ongoing experiments yet; e.g., they are [double]-blind trials. If nothing else, we would not want to duplicate any existing experiments! So we want to solve 3-EI using the knowledge of the 2 ongoing experiments.

We call this q,p-EI, so the previous example would be 3,2-EI. The q-EI notation is equivalent to q,0-EI; if we do not explicitly write the value of p, it is 0. So q is the number of new (simultaneous) experiments to select. In code, this would be the size of the output from EI optimization (i.e., best_points_to_sample, of which there are q = num_to_sample points). p is the number of ongoing/incomplete experiments to take into account (i.e., points_being_sampled of which there are p = num_being_sampled points).

Back to optimization: the idea behind gradient descent is simple. The gradient gives us the direction of steepest ascent (negative gradient is steepest descent). So each iteration, we compute the gradient and take a step in that direction. The size of the step is not specified by GD and is left to the specific implementation. Basically if we take steps that are too large, we run the risk of over-shooting the solution and even diverging. If we take steps that are too small, it may take an intractably long time to reach the solution. Thus the magic is in choosing the step size; we do not claim that our implementation is perfect, but it seems to work reasonably. See gpp_optimization.hpp for more details about GD as well as the template definition.

For particularly difficult problems or problems where gradient descent’s parameters are not well-chosen, GD can fail to converge. If this happens, we can fall back on heuristics; e.g., ‘dumb’ search (i.e., evaluate EI at a large number of random points and take the best one). Naive search lives in: ComputeOptimalPointsToSampleViaLatinHypercubeSearch<>().

2. FILE OVERVIEW

This file contains mathematical functions supporting optimal learning. These include functions to compute characteristics of Gaussian Processes (e.g., variance, mean) and the gradients of these quantities as well as functions to compute and optimize the expected improvement.

Functions here generally require some combination of a CovarianceInterface object as well as data about prior and current (i.e., concurrent) experiments. These data are encapsulated in the GaussianProcess class. Then we build an ExpectedImprovementEvaluator object (with associated state, see gpp_common.hpp item 5 for (Evaluator, State) relations) on top of a GaussianProcess for computing and optimizing EI.

For further theoretical details about Gaussian Processes, see Rasmussen and Williams, Gaussian Processes for Machine Learning (2006). A bare-bones summary is provided in gpp_math.cpp.

For further details about expected improvement and the optimization thereof, see Scott Clark’s PhD thesis. Again, a summary is provided in gpp_math.cpp‘s file comments.

3. IMPLEMENTATION NOTES

  1. This file has a few primary endpoints for EI optimization:

    1. ComputeOptimalPointsToSampleWithRandomStarts<>():

      Solves the q,p-EI problem.

      Takes in a gaussian_process describing the prior, domain, config, etc.; outputs the next best point(s) (experiment) to sample (run). Uses gradient descent.

    2. ComputeOptimalPointsToSampleViaLatinHypercubeSearch<>():

      Estimates the q,p-EI problem.

      Takes in a gaussian_process describing the prior, domain, etc.; outputs the next best point(s) (experiment) to sample (run). Uses ‘dumb’ search.

    3. ComputeOptimalPointsToSample<>() (Recommended):

      Solves the q,p-EI problem.

      Wraps the previous two items; relies on gradient descent and falls back to “dumb” search if it fails.

    Note

    See gpp_math.cpp‘s header comments for more detailed implementation notes.

    There are also several other functions with external linkage in this header; these are provided primarily to ease testing and to permit lower level access from python.

  2. See gpp_common.hpp header comments for additional implementation notes.

4. NOTATION

And domain-specific notation, following Rasmussen, Williams:

  • X = points_sampled; this is the training data (size dim X num_sampled), also called the design matrix
  • Xs = points_to_sample; this is the test data (size dim X num_to_sample``)
  • y, f, f(x) = points_sampled_value, the experimental results from sampling training points
  • K, K_{ij}, K(X,X) = covariance(X_i, X_j), covariance matrix between training inputs (num_sampled x num_sampled)
  • Ks, Ks_{ij}, K(X,Xs) = covariance(X_i, Xs_j), covariance matrix between training and test inputs (num_sampled x num_to_sample)
  • Kss, Kss_{ij}, K(Xs,Xs) = covariance(Xs_i, Xs_j), covariance matrix between test inputs (num_to_sample x num_to_sample)
  • \theta: (vector) of hyperparameters for a covariance function

Note

Due to confusion with multiplication (K_* looks awkward in code comments), Rasmussen & Williams’ \(\, K_*\) notation has been repalced with Ks and \(\, K_{**}\) is Kss.

Connecting to the q,p-EI notation, both the points represented by “q” and “p” are represented by Xs. Within the GP, there is no distinction between points being sampled by ongoing experiments and new points to sample.

5. CITATIONS

  1. Gaussian Processes for Machine Learning. Carl Edward Rasmussen and Christopher K. I. Williams. 2006. Massachusetts Institute of Technology. 55 Hayward St., Cambridge, MA 02142. http://www.gaussianprocess.org/gpml/ (free electronic copy)
  2. Parallel Machine Learning Algorithms In Bioinformatics and Global Optimization (PhD Dissertation). Part II, EPI: Expected Parallel Improvement Scott Clark. 2012. Cornell University, Center for Applied Mathematics. Ithaca, NY. https://github.com/sc932/Thesis sclark@yelp.com
  3. Differentiation of the Cholesky Algorithm. S. P. Smith. 1995. Journal of Computational and Graphical Statistics. Volume 4. Number 2. p134-147
  4. A Multi-points Criterion for Deterministic Parallel Global Optimization based on Gaussian Processes. David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. 2008. D´epartement 3MI. Ecole Nationale Sup´erieure des Mines. 158 cours Fauriel, Saint-Etienne, France. ginsbourger@emse.fr, leriche@emse.fr, carraro@emse.fr
  5. Efficient Global Optimization of Expensive Black-Box Functions. Jones, D.R., Schonlau, M., Welch, W.J. 1998. Journal of Global Optimization, 13, 455-492.

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.

Functions

OL_NONNULL_POINTERS void SetupExpectedImprovementState(const OnePotentialSampleExpectedImprovementEvaluator & ei_evaluator, double const *restrict starting_point, int max_num_threads, bool configure_for_gradients, std::vector< typename OnePotentialSampleExpectedImprovementEvaluator::StateType > * state_vector)

Set up vector of OnePotentialSampleExpectedImprovementEvaluator::StateType.

This is a utility function just for reducing code duplication.

dim is the spatial dimension, ei_evaluator.dim()

Parameters:
ei_evaluator:evaluator object associated w/the state objects being constructed
starting_point[dim]:
 initial point to load into state (must be a valid point for the problem)
max_num_threads:
 maximum number of threads for use by OpenMP (generally should be <= # cores)
configure_for_gradients:
 true if these state objects will be used to compute gradients, false otherwise
state_vector[arbitrary]:
 vector of state objects, arbitrary size (usually 0)
Outputs:
state_vector[max_num_threads]:
 vector of states containing max_num_threads properly initialized state objects

OL_NONNULL_POINTERS void SetupExpectedImprovementState(const ExpectedImprovementEvaluator & ei_evaluator, double const *restrict points_to_sample, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, int max_num_threads, bool configure_for_gradients, NormalRNG * normal_rng, std::vector< typename ExpectedImprovementEvaluator::StateType > * state_vector)

Set up vector of ExpectedImprovementEvaluator::StateType.

This is a utility function just for reducing code duplication.

Parameters:
ei_evaluator:evaluator object associated w/the state objects being constructed
points_to_sample[dim][num_to_sample]:
 initial points to load into state (must be a valid point for the problem); i.e., points at which to evaluate EI and/or its gradient
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrently experiments
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the p in q,p-EI)
max_num_threads:
 maximum number of threads for use by OpenMP (generally should be <= # cores)
configure_for_gradients:
 true if these state objects will be used to compute gradients, false otherwise
state_vector[arbitrary]:
 vector of state objects, arbitrary size (usually 0)
normal_rng[max_num_threads]:
 a vector of NormalRNG objects that provide the (pesudo)random source for MC integration
Outputs:
state_vector[max_num_threads]:
 vector of states containing max_num_threads properly initialized state objects

template < typename ExpectedImprovementEvaluator, typename DomainType >
void RestartedGradientDescentEIOptimization(const ExpectedImprovementEvaluator & ei_evaluator, const GradientDescentParameters & optimizer_parameters, const DomainType & domain, double const *restrict initial_guess, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, NormalRNG * normal_rng, double *restrict next_point)

Solve the q,p-EI problem (see ComputeOptimalPointsToSample and/or header docs) by optimizing the Expected Improvement. Optimization is done using restarted Gradient Descent, via GradientDescentOptimizer<...>::Optimize() from gpp_optimization.hpp. Please see that file for details on gradient descent and see gpp_optimizer_parameters.hpp for the meanings of the GradientDescentParameters.

This function is just a simple wrapper that sets up the Evaluator’s State and calls a general template for restarted GD.

This function does not perform multistarting or employ any other robustness-boosting heuristcs; it only converges if the initial_guess is close to the solution. In general, ComputeOptimalPointsToSample() (see below) is preferred. This function is meant for:

  1. easier testing;
  2. if you really know what you’re doing.

Solution is guaranteed to lie within the region specified by domain; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Parameters:
ei_evaluator:reference to object that can compute ExpectedImprovement and its spatial gradient
optimizer_parameters:
 GradientDescentParameters object that describes the parameters controlling EI optimization (e.g., number of iterations, tolerances, learning rate)
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
initial_guess[dim][num_to_sample]:
 initial guess for gradient descent
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrent experiments
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the “p” in q,p-EI)
normal_rng[1]:a NormalRNG object that provides the (pesudo)random source for MC integration
Outputs:
normal_rng[1]:NormalRNG object will have its state changed due to random draws
next_point[dim][num_to_sample]:
 points yielding the best EI according to gradient descent

template < typename DomainType >
OL_NONNULL_POINTERS void ComputeOptimalPointsToSampleViaMultistartGradientDescent(const GaussianProcess & gaussian_process, const GradientDescentParameters & optimizer_parameters, const DomainType & domain, const ThreadSchedule & thread_schedule, double const *restrict start_point_set, double const *restrict points_being_sampled, int num_multistarts, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, NormalRNG * normal_rng, bool *restrict found_flag, double *restrict best_next_point)

Perform multistart gradient descent (MGD) to solve the q,p-EI problem (see ComputeOptimalPointsToSample and/or header docs). Starts a GD run from each point in start_point_set. The point corresponding to the optimal EI* is stored in best_next_point.

* Multistarting is heuristic for global optimization. EI is not convex so this method may not find the true optimum.

This function wraps MultistartOptimizer<>::MultistartOptimize() (see gpp_optimization.hpp), which provides the multistarting component. Optimization is done using restarted Gradient Descent, via GradientDescentOptimizer<...>::Optimize() from gpp_optimization.hpp. Please see that file for details on gradient descent and see gpp_optimizer_parameters.hpp for the meanings of the GradientDescentParameters.

This function (or its wrappers, e.g., ComputeOptimalPointsToSampleWithRandomStarts) are the primary entry-points for gradient descent based EI optimization in the optimal_learning library.

Users may prefer to call ComputeOptimalPointsToSample(), which applies other heuristics to improve robustness.

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline for sizing the domain and num_multistarts; i.e., be wary of sets of initial guesses that cover the space too sparsely.

Solution is guaranteed to lie within the region specified by domain; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Warning

This function fails ungracefully if NO improvement can be found! In that case, best_next_point will always be the first point in start_point_set. found_flag will indicate whether this occured.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
optimizer_parameters:
 GradientDescentParameters object that describes the parameters controlling EI optimization (e.g., number of iterations, tolerances, learning rate)
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_dynamic), chunk_size (0).
start_point_set[dim][num_to_sample][num_multistarts]:
 set of initial guesses for MGD (one block of num_to_sample points per multistart)
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrent experiments
num_multistarts:
 number of points in set of initial guesses
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the “p” in q,p-EI)
best_so_far:value of the best sample so far (must be min(points_sampled_value))
max_int_steps:maximum number of MC iterations
normal_rng[thread_schedule.max_num_threads]:
 a vector of NormalRNG objects that provide the (pesudo)random source for MC integration
Outputs:
normal_rng[thread_schedule.max_num_threads]:
 NormalRNG objects will have their state changed due to random draws
found_flag[1]:true if best_next_point corresponds to a nonzero EI
best_next_point[dim][num_to_sample]:
 points yielding the best EI according to MGD

template < typename DomainType >
void ComputeOptimalPointsToSampleWithRandomStarts(const GaussianProcess & gaussian_process, const GradientDescentParameters & optimizer_parameters, const DomainType & domain, const ThreadSchedule & thread_schedule, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, NormalRNG * normal_rng, double *restrict best_next_point)

Perform multistart gradient descent (MGD) to solve the q,p-EI problem (see ComputeOptimalPointsToSample and/or header docs), starting from num_multistarts points selected randomly from the within th domain.

This function is a simple wrapper around ComputeOptimalPointsToSampleViaMultistartGradientDescent(). It additionally generates a set of random starting points and is just here for convenience when better initial guesses are not available.

See ComputeOptimalPointsToSampleViaMultistartGradientDescent() for more details.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
optimizer_parameters:
 GradientDescentParameters object that describes the parameters controlling EI optimization (e.g., number of iterations, tolerances, learning rate)
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_dynamic), chunk_size (0).
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrent experiments
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the “p” in q,p-EI)
best_so_far:value of the best sample so far (must be min(points_sampled_value))
max_int_steps:maximum number of MC iterations
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
normal_rng[thread_schedule.max_num_threads]:
 a vector of NormalRNG objects that provide the (pesudo)random source for MC integration
Outputs:
found_flag[1]:true if best_next_point corresponds to a nonzero EI
uniform_generator[1]:
 UniformRandomGenerator object will have its state changed due to random draws
normal_rng[thread_schedule.max_num_threads]:
 NormalRNG objects will have their state changed due to random draws
best_next_point[dim][num_to_sample]:
 points yielding the best EI according to MGD

template < typename DomainType >
void ComputeOptimalPointsToSampleViaLatinHypercubeSearch(const GaussianProcess & gaussian_process, const DomainType & domain, const ThreadSchedule & thread_schedule, double const *restrict points_being_sampled, int num_multistarts, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, NormalRNG * normal_rng, double *restrict best_next_point)

Perform a random, naive search to “solve” the q,p-EI problem (see ComputeOptimalPointsToSample and/or header docs). Evaluates EI at num_multistarts points (e.g., on a latin hypercube) to find the point with the best EI value.

Generally gradient descent is preferred but when they fail to converge this may be the only “robust” option.

Solution is guaranteed to lie within the region specified by domain; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Wraps EvaluateEIAtPointList(); constructs the input point list with a uniform random sampling from the given Domain object.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_static), chunk_size (0).
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrent experiments
num_multistarts:
 number of random points to check
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the “p” in q,p-EI)
best_so_far:value of the best sample so far (must be min(points_sampled_value))
max_int_steps:maximum number of MC iterations
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
normal_rng[thread_schedule.max_num_threads]:
 a vector of NormalRNG objects that provide the (pesudo)random source for MC integration
Outputs:
found_flag[1]: true if best_next_point corresponds to a nonzero EI :uniform_generator[1]: UniformRandomGenerator object will have its state changed due to random draws :normal_rng[thread_schedule.max_num_threads]: NormalRNG objects will have their state changed due to random draws :best_next_point[dim][num_to_sample]: points yielding the best EI according to dumb search

class ExpectedImprovementEvaluator

A class to encapsulate the computation of expected improvement and its spatial gradient. This class handles the general EI computation case using monte carlo integration; it can support q,p-EI optimization. It is designed to work with any GaussianProcess. Additionally, this class has no state and within the context of EI optimization, it is meant to be accessed by const reference only.

The random numbers needed for EI computation will be passed as parameters instead of contained as members to make multithreading more straightforward.

Public Type

typedef ExpectedImprovementState StateType

Public Functions

ExpectedImprovementEvaluator(const GaussianProcess & gaussian_process_in, int num_mc_iterations, double best_so_far)

Constructs a ExpectedImprovementEvaluator object. All inputs are required; no default constructor nor copy/assignment are allowed.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
num_mc_iterations:
 number of monte carlo iterations
best_so_far:best (minimum) objective function value (in points_sampled_value)

int dim()

const GaussianProcess * gaussian_process()

double ComputeObjectiveFunction(StateType * ei_state)

Wrapper for ComputeExpectedImprovement(); see that function for details.

void ComputeGradObjectiveFunction(StateType * ei_state, double *restrict grad_EI)

Wrapper for ComputeGradExpectedImprovement(); see that function for details.

double ComputeExpectedImprovement(StateType * ei_state)

Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+], where Xs are potential points to sample (union of points_to_sample and points_being_sampled) and X are already sampled points. The ^+ indicates that the expression in the expectation evaluates to 0 if it is negative. f^*(X) is the MINIMUM over all known function evaluations (points_sampled_value), whereas f(Xs) are GP-predicted function evaluations.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

In words, we are computing the expected improvement (over the current best_so_far, best known objective function value) that would result from sampling (aka running new experiments) at points_to_sample with points_being_sampled concurrent/ongoing experiments.

In general, the EI expression is complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The idea of the MC approach is to repeatedly sample at the union of points_to_sample and points_being_sampled. This is analogous to gaussian_process_interface.sample_point_from_gp, but we sample num_union points at once:

y = \mu + Lw

where \mu is the GP-mean, L is the chol_factor(GP-variance) and w is a vector of num_union draws from N(0, 1). Then:

improvement_per_step = max(max(best_so_far - y), 0.0)

Observe that the inner max means only the smallest component of y contributes in each iteration. We compute the improvement over many random draws and average.

Note

These comments were copied into ExpectedImprovementInterface.compute_expected_improvement() in interfaces/expected_improvement_interface.py.

Parameters:
ei_state[1]:properly configured state object
Outputs:
ei_state[1]:state with temporary storage modified; normal_rng modified
Returns:
the expected improvement from sampling points_to_sample with points_being_sampled concurrent experiments

Let Ls * Ls^T = Vars and w = vector of IID normal(0,1) variables Then:

y = mus + Ls * w (Equation 4, from file docs)

simulates drawing from our GP with mean mus and variance Vars.

Then as given in the file docs, we compute the improvement: Then the improvement for this single sample is:

I = { best_known - min(y)   if (best_known - min(y) > 0)      (Equation 5 from file docs)
    {          0               else

This is implemented as max_{y} (best_known - y). Notice that improvement takes the value 0 if it would be negative.

Since we cannot compute min(y) directly, we do so via monte-carlo (MC) integration. That is, we draw from the GP repeatedly, computing improvement during each iteration, and averaging the result.

See Scott’s PhD thesis, sec 6.2.

Note

comments here are copied to _compute_expected_improvement_monte_carlo() in python_version/expected_improvement.py

void ComputeGradExpectedImprovement(StateType * ei_state, double *restrict grad_EI)

Computes the (partial) derivatives of the expected improvement with respect to each point of points_to_sample. As with ComputeExpectedImprovement(), this computation accounts for the effect of points_being_sampled concurrent experiments.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI..

In general, the expressions for gradients of EI are complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The MC computation of grad EI is similar to the computation of EI (decsribed in compute_expected_improvement). We differentiate y = \mu + Lw wrt points_to_sample; only terms from the gradient of \mu and L contribute. In EI, we computed:

improvement_per_step = max(max(best_so_far - y), 0.0)

and noted that only the smallest component of y may contribute (if it is > 0.0). Call this index winner. Thus in computing grad EI, we only add gradient terms that are attributable to the winner-th component of y.

Note

These comments were copied into ExpectedImprovementInterface.compute_expected_improvement() in interfaces/expected_improvement_interface.py.

Parameters:
ei_state[1]:properly configured state object
Outputs:
ei_state[1]:state with temporary storage modified; normal_rng modified
grad_EI[dim][num_to_sample]:
 gradient of EI, \pderiv{EI(Xq \cup Xp)}{Xq_{d,i}} where Xq is points_to_sample and Xp is points_being_sampled (grad EI from sampling points_to_sample with points_being_sampled concurrent experiments wrt each dimension of the points in points_to_sample)

Computes gradient of EI (see ExpectedImprovementEvaluator::ComputeGradExpectedImprovement) wrt points_to_sample (stored in union_of_points[0:num_to_sample]).

Mechanism is similar to the computation of EI, where points’ contributions to the gradient are thrown out of their corresponding improvement <= 0.0.

Thus \nabla(\mu) only contributes when the winner (point w/best improvement this iteration) is the current point. That is, the gradient of \mu at x_i wrt x_j is 0 unless i == j (and only this result is stored in ei_state->grad_mu). The interaction with ei_state->grad_chol_decomp is harder to know a priori (like with grad_mu) and has a more complex structure (rank 3 tensor), so the derivative wrt x_j is computed fully, and the relevant submatrix (indexed by the current winner) is accessed each iteration.

Note

comments here are copied to _compute_grad_expected_improvement_monte_carlo() in python_version/expected_improvement.py

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(ExpectedImprovementEvaluator)

Private Members

const int dim_

spatial dimension (e.g., entries per point of points_sampled)

int num_mc_iterations_

number of monte carlo iterations

double best_so_far_

best (minimum) objective function value (in points_sampled_value)

const GaussianProcess * gaussian_process_

pointer to gaussian process used in EI computations

class ExpectedImprovementState

State object for ExpectedImprovementEvaluator. This tracks the points being sampled in concurrent experiments (points_being_sampled) ALONG with the points currently being evaluated via expected improvement for future experiments (called points_to_sample); these are the p and q of q,p-EI, respectively. points_to_sample joined with points_being_sampled is stored in union_of_points in that order.

This struct also tracks the state of the GaussianProcess that underlies the expected improvement computation: the GP state is built to handle the initial union_of_points, and subsequent updates to points_to_sample in this object also update the GP state.

This struct also holds a pointer to a random number generator needed for Monte Carlo integrated EI computations.

Warning

Users MUST guarantee that multiple state objects DO NOT point to the same RNG (in a multithreaded env).

See general comments on State structs in gpp_common.hpp‘s header docs.

Public Type

typedef ExpectedImprovementEvaluator EvaluatorType

Public Functions

ExpectedImprovementState(const EvaluatorType & ei_evaluator, double const *restrict points_to_sample, double const *restrict points_being_sampled, int num_to_sample_in, int num_being_sampled_in, bool configure_for_gradients, NormalRNGInterface * normal_rng_in)

Constructs an ExpectedImprovementState object with a specified source of randomness for the purpose of computing EI (and its gradient) over the specified set of points to sample. This establishes properly sized/initialized temporaries for EI computation, including dependent state from the associated Gaussian Process (which arrives as part of the ei_evaluator).

Warning

This object is invalidated if the associated ei_evaluator is mutated. SetupState() should be called to reset.

Warning

Using this object to compute gradients when configure_for_gradients := false results in UNDEFINED BEHAVIOR.

Parameters:
ei_evaluator:expected improvement evaluator object that specifies the parameters & GP for EI evaluation
points_to_sample[dim][num_to_sample]:
 points at which to evaluate EI and/or its gradient to check their value in future experiments (i.e., test points for GP predictions)
points_being_sampled[dim][num_being_sampled]:
 points being sampled in concurrent experiments
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled in concurrent experiments (i.e., the “p” in q,p-EI)
configure_for_gradients:
 true if this object will be used to compute gradients, false otherwise
normal_rng[1]:pointer to a properly initialized* NormalRNG object

Note

* The NormalRNG object must already be seeded. If multithreaded computation is used for EI, then every state object must have a different NormalRNG (different seeds, not just different objects).

ExpectedImprovementState(ExpectedImprovementState && other)

int GetProblemSize()

void GetCurrentPoint(double *restrict points_to_sample)

Get the points_to_sample: potential future samples whose EI (and/or gradients) are being evaluated

Outputs:
points_to_sample[dim][num_to_sample]:
 potential future samples whose EI (and/or gradients) are being evaluated

void SetCurrentPoint(const EvaluatorType & ei_evaluator, double const *restrict points_to_sample)

Change the potential samples whose EI (and/or gradient) are being evaluated. Update the state’s derived quantities to be consistent with the new points.

Parameters:
ei_evaluator:expected improvement evaluator object that specifies the parameters & GP for EI evaluation
points_to_sample[dim][num_to_sample]:
 potential future samples whose EI (and/or gradients) are being evaluated

void SetupState(const EvaluatorType & ei_evaluator, double const *restrict points_to_sample)

Configures this state object with new points_to_sample, the location of the potential samples whose EI is to be evaluated. Ensures all state variables & temporaries are properly sized. Properly sets all dependent state variables (e.g., GaussianProcess’s state) for EI evaluation.

Warning

This object’s state is INVALIDATED if the ei_evaluator (including the GaussianProcess it depends on) used in SetupState is mutated! SetupState() should be called again in such a situation.

Parameters:
ei_evaluator:expected improvement evaluator object that specifies the parameters & GP for EI evaluation
points_to_sample[dim][num_to_sample]:
 potential future samples whose EI (and/or gradients) are being evaluated

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(ExpectedImprovementState)

Public Members

const int dim

spatial dimension (e.g., entries per point of points_sampled)

const int num_to_sample

number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)

const int num_being_sampled

number of points being sampled concurrently (i.e., the “p” in q,p-EI)

const int num_derivatives

number of derivative terms desired (usually 0 for no derivatives or num_to_sample)

const int num_union

number of points in union_of_points: num_to_sample + num_being_sampled

std::vector< double > union_of_points

points currently being sampled; this is the union of the points represented by “q” and “p” in q,p-EI points_to_sample is stored first in memory, immediately followed by points_being_sampled

GaussianProcess::StateType points_to_sample_state

gaussian process state

NormalRNGInterface * normal_rng

random number generator

std::vector< double > to_sample_mean

the mean of the GP evaluated at union_of_points

std::vector< double > grad_mu

the gradient of the GP mean evaluated at union_of_points, wrt union_of_points[0:num_to_sample]

std::vector< double > cholesky_to_sample_var

the cholesky (LL^T) factorization of the GP variance evaluated at union_of_points

std::vector< double > grad_chol_decomp

the gradient of the cholesky (LL^T) factorization of the GP variance evaluated at union_of_points wrt union_of_points[0:num_to_sample]

std::vector< double > EI_this_step_from_var

improvement (per mc iteration) evaluated at each of union_of_points

std::vector< double > aggregate

tracks the aggregate grad EI from all mc iterations

std::vector< double > normals

normal rng draws

Public Static Functions

std::vector< double > BuildUnionOfPoints(double const *restrict points_to_sample, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, int dim)

Create a vector with the union of points_to_sample and points_being_sampled (the latter is appended to the former).

Note the l-value return. Assigning the return to a std::vector<double> or passing it as an argument to the ctor will result in copy-elision or move semantics; no copying/performance loss.

Parameters::
points_to_sample[dim][num_to_sample]:
 points at which to evaluate EI and/or its gradient to check their value in future experiments (i.e., test points for GP predictions)
points_being_sampled[dim][num_being_sampled]:
 points being sampled in concurrent experiments
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled in concurrent experiments (i.e., the “p” in q,p-EI)
dim:the number of spatial dimensions of each point array
Returns:
std::vector<double> with the union of the input arrays: points_being_sampled is appended to points_to_sample

class GaussianProcess

Object that encapsulates Gaussian Process Priors (GPPs). A GPP is defined by a set of (sample point, function value, noise variance) triples along with a covariance function that relates the points. Each point has dimension dim. These are the training data; for example, each sample point might specify an experimental cohort and the corresponding function value is the objective measured for that experiment. There is one noise variance value per function value; this is the measurement error and is treated as N(0, noise_variance) Gaussian noise.

GPPs estimate a real process \(\, f(x) = GP(m(x), k(x,x'))\) (see file docs). This class deals with building an estimator to the actual process using measurements taken from the actual process–the (sample point, function val, noise) triple. Then predictions about unknown points can be made by sampling from the GPP–in particular, finding the (predicted) mean and variance. These functions (and their gradients) are provided in ComputeMeanOfPoints, ComputeVarianceOfPoints, etc.

Further mathematical details are given in the implementation comments, but we are essentially computing:

ComputeMeanOfPoints : K(Xs, X) * [K(X,X) + \sigma_n^2 I]^{-1} * y
ComputeVarianceOfPoints: K(Xs, Xs) - K(Xs,X) * [K(X,X) + \sigma_n^2 I]^{-1} * K(X,Xs)

This (estimated) mean and variance characterize the predicted distributions of the actual \(\, m(x), k(x,x')\) functions that underly our GP.

Note

the preceding comments are copied in Python: interfaces/gaussian_process_interface.py

For testing and experimental purposes, this class provides a framework for sampling points from the GP (i.e., given a point to sample and predicted measurement noise) as well as adding additional points to an already-formed GP. Sampling points requires drawing from \(\, N(0,1)\) so this class also holds PRNG state to do so via the NormalRNG object from gpp_random.

Note

Functions that manipulate the PRNG directly or indirectly (changing state, generating points) are NOT THREAD-SAFE. All thread-safe functions are marked const.

These mean/variance methods require some external state: namely, the set of potential points to sample. Additionally, temporaries and derived quantities depending on these “points to sample” eliminate redundant computation. This external state is handled through PointsToSampleState objects, which are constructed separately and filled through PointsToSampleState::SetupState() which interacts with functions in this class.

Public Type

typedef PointsToSampleState StateType

typedef NormalRNG NormalGeneratorType

typedef NormalGeneratorType::EngineType EngineType

Public Functions

GaussianProcess(const CovarianceInterface & covariance_in, double const *restrict points_sampled_in, double const *restrict points_sampled_value_in, double const *restrict noise_variance_in, int dim_in, int num_sampled_in)

Constructs a GaussianProcess object. All inputs are required; no default constructor nor copy/assignment are allowed.

Warning

points_sampled is not allowed to contain duplicate points; doing so results in singular covariance matrices.

Parameters:
covariance:the CovarianceFunction object encoding assumptions about the GP’s behavior on our data
points_sampled[dim][num_sampled]:
 points that have already been sampled
points_sampled_value[num_sampled]:
 values of the already-sampled points
noise_variance[num_sampled]:
 the \sigma_n^2 (noise variance) associated w/observation, points_sampled_value
dim:the spatial dimension of a point (i.e., number of independent params in experiment)
num_sampled:number of already-sampled points

int dim()

int num_sampled()

const std::vector< double > & points_sampled()

const std::vector< double > & points_sampled_value()

const std::vector< double > & noise_variance()

void SetCovarianceHyperparameters(double const *restrict hyperparameters_new)

Change the hyperparameters of this GP’s covariance function. Also forces recomputation of all derived quantities for GP to remain consistent.

Warning

Using this function invalidates any PointsToSampleState objects created with “this” object. For any such objects “state”, call state.SetupState(...) to restore them.

Parameters:
hyperparameters_new[covariance_ptr->GetNumberOfHyperparameters]:
 new hyperparameter array

void FillPointsToSampleState(StateType * points_to_sample_state)

Sets up the PointsToSampleState object so that it can be used to compute GP mean, variance, and gradients thereof. ASSUMES all needed space is ALREADY ALLOCATED.

This function should not be called directly; instead use PointsToSampleState::SetupState().

Parameters:
points_to_sample_state[1]:
 pointer to a PointsToSampleState object where all space has been properly allocated
Outputs:
points_to_sample_state[1]:
 pointer to a fully configured PointsToSampleState object. overwrites input

Sets up precomputed quantities needed for mean, variance, and gradients thereof. These quantities are:

Ks := Ks_{k,i} = cov(X_k, Xs_i) (used by mean, variance)

Then if we need gradients:

K^-1 * Ks := solution X of K_{k,l} * X_{l,i} = Ks{k,i} (used by variance, grad variance)
gradient of Ks := C_{d,k,i} = \pderiv{Ks_{k,i}}{Xs_{d,i}} (used by grad mean, grad variance)

void AddPointsToGP(double const *restrict new_points, double const *restrict new_points_value, double const *restrict new_points_noise_variance, int num_new_points)

Add the specified (point, fcn value, noise variance) historical data to this GP.

Forces recomputation of all derived quantities for GP to remain consistent.

Parameters:
new_points[dim][num_new_points]:
 coordinates of each new point to add
new_points_value[num_new_points]:
 function value at each new point
new_points_noise_variance[num_new_points]:
 sigma_n^2 corresponding to the signal noise in measuring new_points_value
num_new_points:number of new points to add to the GP

double SamplePointFromGP(double const *restrict point_to_sample, double noise_variance_this_point)

Sample a function value from a Gaussian Process prior, provided a point at which to sample.

Uses the formula function_value = gpp_mean + sqrt(gpp_variance) * w1 + sqrt(noise_variance) * w2, where w1, w2 are draws from \(\, N(0,1)\).

Note

Set noise_variance to 0 if you want “accurate” draws from the GP. BUT if the drawn (point, value) pair is meant to be added back into the GP (e.g., for testing), then this point MUST be drawn with noise_variance equal to the noise associated with “point” as a member of “points_sampled”

Parameters:
point_to_sample[dim]:
 coordinates of the point at which to generate a function value (from GP)
noise_variance_this_point:
 if this point is to be added into the GP, it needs to be generated with its associated noise var
Returns:
function value drawn from this GP

Samples function values from a GPP given a list of points.

Samples by: function_value = gpp_mean + gpp_variance * w, where w is a single draw from N(0,1).

We only draw one point at a time (i.e., num_to_sample fixed at 1). We want multiple draws from the same GPP; drawing many points per step would be akin to sampling multiple GPPs. Thus gpp_mean, gpp_variance, and w all have size 1.

If the GPP does not receive any data, then on the first step, gpp_mean = 0 and gpp_variance is just the “covariance” of a single point. Then we iterate through the remaining points in points_sampled, generating gpp_mean, gpp_variance, and a sample function value.

void ComputeMeanOfPoints(const StateType & points_to_sample_state, double *restrict mean_of_points)

Computes the mean of this GP at each of Xs (points_to_sample).

Note

points_to_sample should not contain duplicate points.

Note

comments are copied in Python: interfaces/gaussian_process_interface.py

Parameters:
points_to_sample_state:
 a FULLY CONFIGURED PointsToSampleState (configure via PointsToSampleState::SetupState)
Outputs:
mean_of_points[num_to_sample]:
 mean of GP, one per GP dimension

Calculates the mean (from the GPP) of a set of points:

mus = Ks^T * K^-1 * y

See Rasmussen and Willians page 19 alg 2.1

void ComputeGradMeanOfPoints(const StateType & points_to_sample_state, double *restrict grad_mu)

Computes the gradient of the mean of this GP at each of Xs (points_to_sample) wrt Xs.

Note

points_to_sample should not contain duplicate points.

Note that grad_mu is nominally sized: grad_mu[dim][num_to_sample][num_to_sample]. However, for 0 <= i,j < num_to_sample, i != j, grad_mu[d][i][j] = 0. (See references or implementation for further details.) Thus, grad_mu is stored in a reduced form which only tracks the nonzero entries.

Note

comments are copied in Python: interfaces/gaussian_process_interface.py

Parameters:
points_to_sample_state:
 a FULLY CONFIGURED PointsToSampleState (configure via PointsToSampleState::SetupState)
Outputs:
grad_mu[dim][state.num_derivatives]:
 gradient of the mean of the GP. grad_mu[d][i] is actually the gradient of \mu_i with respect to x_{d,i}, the d-th dimension of the i-th entry of points_to_sample.

Gradient of the mean of a GP. Note that the output storage skips known zeros (see declaration docs for details). See Scott Clark’s PhD thesis for more spelled out mathematical details, but this is a reasonably straightforward differentiation of:

mus = Ks^T * K^-1 * y

wrt Xs (so only Ks contributes derivative terms)

void ComputeVarianceOfPoints(StateType * points_to_sample_state, double *restrict var_star)

Computes the variance (matrix) of this GP at each point of Xs (points_to_sample).

The variance matrix is symmetric (in fact, SPD) and is stored in the LOWER TRIANGLE.

Note

points_to_sample should not contain duplicate points.

Note

comments are copied in Python: interfaces/gaussian_process_interface.py

Parameters:
points_to_sample_state[1]:
 ptr to a FULLY CONFIGURED PointsToSampleState (configure via PointsToSampleState::SetupState)
Outputs:
points_to_sample_state[1]:
 ptr to a FULLY CONFIGURED PointsToSampleState; only temporary state may be mutated
var_star[num_to_sample][num_to_sample]:
 variance of GP evaluated at points_to_sample

Mathematically, we are computing Vars (Var_star), the GP variance. Vars is defined at the top of this file (Equation 3) and in Rasmussen & Williams, Equation 2.19:

L * L^T = K
V = L^-1 * Ks
Vars = Kss - (V^T * V)

This quantity is:

Kss: the covariance between test points based on the prior distribution

minus

V^T * V: the information observations give us about the objective function

Notice that Vars is clearly symmetric. Kss is SPD. And V^T * V = (V^T * V)^T is symmetric (and is in fact SPD).

V^T * V = Ks^T * K^-1 * K_s is SPD because:

X^T * A * X is SPD when A is SPD AND X has full rank (X need not be square)

Ks has full rank as long as K & Kss are SPD; K^-1 is SPD because K is SPD.

It turns out that Vars is SPD.

In Equation 1 (Rasmussen & Williams 2.18), it is clear that the combined covariance matrix is SPD (as long as no duplicate points and the covariance function is valid). A matrix of the form:

[ A   B ]
[ B^T C ]

is SPD if and only if A is SPD AND (C - B^T * A^-1 * B) is SPD. Here, A = K, B = Ks, C = Kss. This (aka Schur Complement) can be shown readily:

[ A   B ] = [  I            0 ] * [  A    0                ] * [ I   A^-1 * B ]
[ B^T C ]   [ (A^-1 * B)^T  I ] * [  0 (C - B^T * A^-1 * B)]   [ 0       I    ]

This factorization is valid because A is SPD (and thus invertible). Then by the X^T * A * X rule for SPD-ness, we know the block-diagonal matrix in the center is SPD. Hence the SPD-ness of V^T * V follows readily.

For more information, see: http://en.wikipedia.org/wiki/Schur_complement

void ComputeGradVarianceOfPoints(StateType * points_to_sample_state, double *restrict grad_var)

Similar to ComputeGradCholeskyVarianceOfPoints() except this does not include the gradient terms from the cholesky factorization. Description will not be duplicated here.

This is just a thin wrapper that calls ComputeGradVarianceOfPointsPerPoint() in a loop num_derivatives times.

See ComputeGradVarianceOfPointsPerPoint()’s function comments and implementation for more mathematical details on the derivation, algorithm, optimizations, etc.

void ComputeGradCholeskyVarianceOfPoints(StateType * points_to_sample_state, double const *restrict chol_var, double *restrict grad_chol)

Computes the gradient of the cholesky factorization of the variance of this GP with respect to points_to_sample. This function accounts for the effect on the gradient resulting from cholesky-factoring the variance matrix. See Smith 1995 for algorithm details.

points_to_sample is not allowed to contain duplicate points. Violating this results in a singular variance matrix.

Note that grad_chol is nominally sized:

grad_chol[dim][num_to_sample][num_to_sample][num_to_sample].

Let this be indexed grad_chol[d][i][j][k], which is read the derivative of var[i][j] with respect to x_{d,k} (x = points_to_sample)

Note

comments are copied in Python: interfaces/gaussian_process_interface.py

Parameters:
points_to_sample_state[1]:
 ptr to a FULLY CONFIGURED PointsToSampleState (configure via PointsToSampleState::SetupState)
chol_var[num_to_sample][num_to_sample]:
 the variance (matrix) of this GP at each point of Xs (points_to_sample) e.g., from the cholesky factorization of ComputeVarianceOfPoints
Outputs:
points_to_sample_state[1]:
 ptr to a FULLY CONFIGURED PointsToSampleState; only temporary state may be mutated
grad_chol[dim][num_to_sample][num_to_sample][state->num_derivatives]:
 gradient of the cholesky-factored variance of the GP. grad_chol[d][i][j][k] is actually the gradients of var_{i,j} with respect to x_{d,k}, the d-th dimension of the k-th entry of points_to_sample

This is just a thin wrapper that calls ComputeGradCholeskyVarianceOfPointsPerPoint() in a loop num_derivatives times.

See ComputeGradCholeskyVarianceOfPointsPerPoint()’s function comments and implementation for more mathematical details on the algorithm.

void SetExplicitSeed(EngineType::result_type seed)

Seed the random number generator with the specified seed. See gpp_random, struct NormalRNG for details.

Parameters:
seed:new seed to set

void SetRandomizedSeed(EngineType::result_type seed)

Seed the random number generator using a combination of the specified seed, current time, and potentially other factors. See gpp_random, struct NormalRNG for details.

Parameters:
seed:base value for new seed

void ResetToMostRecentSeed()

Seeds the generator with its last used seed value. Useful for testing–e.g., can conduct multiple runs with the same initial conditions

GaussianProcess * Clone()

Clones “this” GaussianProcess.

Returns:
Pointer to a constructed object that is a copy of “this”

OL_DISALLOW_DEFAULT_AND_ASSIGN(GaussianProcess)

Public Static Attributes

constexpr EngineType::result_type kDefaultSeed

Default seed value to make reproducing test results simple.

constexpr double kMinimumStdDev

Minimum allowed standard deviation value in ComputeGradCholeskyVarianceOfPointsPerPoint (= machine precision). Values that are too small result in problems b/c we may compute std_dev/var (which is enormous if std_dev = 1.0e-150 and var = 1.0e-300) since this only arises when we fail to compute std_dev = var = 0.0. Note: this is only relevant if noise = 0.0; this minimum will not affect GPs with noise since this value is below the smallest amount of noise users can meaningfully add. This value was chosen to be consistent with the singularity condition in CholeskyFactorL and tested for robustness with the setup in EIOnePotentialSampleEdgeCasesTest().

Protected Functions

GaussianProcess(const GaussianProcess & source)

Private Functions

void BuildCovarianceMatrixWithNoiseVariance()

void BuildMixCovarianceMatrix(double const *restrict points_to_sample, int num_to_sample, double *restrict cov_mat)

void ComputeGradVarianceOfPointsPerPoint(StateType * points_to_sample_state, int diff_index, double *restrict grad_var)

Similar to ComputeGradCholeskyVarianceOfPointsPerPoint() except this does not include the gradient terms from the cholesky factorization. Description will not be duplicated here.

CORE IDEA

Similar to ComputeGradCholeskyVarianceOfPoints() below, except this function does not account for the cholesky decomposition. That is, it produces derivatives wrt Xs_{d,p} (points_to_sample) of:

Vars = Kss - (V^T * V) = Kss - Ks^T * K^-1 * Ks (see ComputeVarianceOfPoints)

Note

normally Xs_p would be the p-th point of Xs (all dimensions); here Xs_{d,p} more explicitly refers to the d-th spatial dimension of the p-th point.

This function only returns the derivative wrt a single choice of p, as specified by diff_index.

Expanded index notation:

Vars_{i,j} = Kss_{i,j} - Ks^T_{i,l} * K^-1_{l,k} * Ks_{k,j}

Recall Ks_{k,i} = cov(X_k, Xs_i) = cov(Xs_i, Xs_k) where Xs is points_to_sample and X is points_sampled. (Note this is not equivalent to saying Ks = Ks^T, although this would be true if |Xs| == |X|.) As a result of this symmetry, \pderiv{Ks_{k,i}}{Xs_{d,i}} = \pderiv{Ks_{i,k}}{Xs_{d,i}} (that’s d(cov(Xs_i, X_k))/d(Xs_i))

We are being more strict with index labels than is standard to clearly specify tensor dimensions. To be clear: 1. i,j range over num_to_sample 2. l,k are the only non-free indices; they range over num_sampled 3. d,p describe the SPECIFIC point being differentiated against in Xs (points_to_sample): d over dimension, p* over num_to_sample

*NOTE: p is fixed! Unlike all other indices, p refers to a SPECIFIC point in the range [0, ..., num_to_sample-1].
Thus, \pderiv{Ks_{k,i}}{Xs_{d,i}} is a 3-tensor (A_{d,k,i}) (repeated i is not summation since they denote components of a derivative) while \pderiv{Ks_{i,l}}{Xs_{d,p}} is a 2-tensor (A_{d,l}) b/c only \pderiv{Ks_{i=p,l}}{Xs_{d,p}} is nonzero, and {d,l} are the only remaining free indices.

Then differentiating against Xs_{d,p} (recall that this is a specific point b/c p is fixed):

\pderiv{Vars_{i,j}}{Xs_{d,p}} = \pderiv{K_ss{i,j}}{Xs_{d,p}} -
(\pderiv{Ks_{i,l}}{Xs_{d,p}} * K^-1_{l,k} * Ks_{k,j}   +  K_s{i,l} * K^-1_{l,k} * \pderiv{Ks_{k,j}}{Xs_{d,p}})

Many of these terms are analytically known to be 0: \pderiv{Ks_{i,l}}{Xs_{d,p}} = 0 when p != i (see NOTE above). A similar statement holds for the other gradient term.

Observe that the second term in the parens, Ks_{i,l} * K^-1_{l,k} * \pderiv{Ks_{k,j}}{Xs_{d,p}}, can be reordered to “look” like the first term. We use three symmetries: K^-1{l,k} = K^-1{k,l}, Ks_{i,l} = Ks_{l,i}, and

\pderiv{Ks_{k,j}}{Xs_{d,p}} = \pderiv{Ks_{j,k}}{Xs_{d,p}}

Then we can write:

K_s{i,l} * K^-1_{l,k} * \pderiv{Ks_{k,j}}{Xs_{d,p}} = \pderiv{Ks_{j,k}}{Xs_{d,p}} * K^-1_{k,l} * K_s{l,i}

Now left and right terms have the same index ordering (i,j match; k,l are not free and thus immaterial)

The final result, accounting for analytic zeros is given here for convenience:

DVars_{d,i,j} \equiv \pderiv{Vars_{i,j}}{Xs_{d,p}} =``
  { \pderiv{K_ss{i,j}}{Xs_{d,p}} - 2*\pderiv{Ks_{i,l}}{Xs_{d,p}} * K^-1_{l,k} * Ks_{k,j}   :  WHEN p == i == j
  { \pderiv{K_ss{i,j}}{Xs_{d,p}} -   \pderiv{Ks_{i,l}}{Xs_{d,p}} * K^-1_{l,k} * Ks_{k,j}   :  WHEN p == i != j
  { \pderiv{K_ss{i,j}}{Xs_{d,p}} -   \pderiv{Ks_{j,k}}{Xs_{d,p}} * K^-1_{k,l} * K_s{l,i}   :  WHEN p == j != i
  {                                    0                                                   :  otherwise

The first item has a factor of 2 b/c it gets a contribution from both parts of the sum since p == i and p == j. The ordering DVars_{d,i,j} is significant: this is the ordering (d changes the fastest) in storage.

OPTIMIZATIONS

Implementing this formula naively results in a large amount of redundant computation, so we now describe the optimizations present in our implementation.

The first thing to notice is that the result, \pderiv{Vars_{i,j}}{Xs_{d,p}}, has a lot of 0s. In particular, only the p-th block row and p-th block column have nonzero entries (blocks are size dim, indexed d). Currently, we will not be taking advantage of this sparsity because the consumer of DVars, ComputeGradCholeskyVarianceOfPoints(), is not implemented with sparsity in mind.

Similarly, the next thing to notice is that if we ignore the case p == i == j, then we see that the expressions for p == i and p == j are actually identical (e.g., take the p == j case and exchange j = i and k = l).

So think of DVars as a block matrix; each block has dimension entries, and the blocks are indexed over i (rows), j (cols). Then we see that the code is block-symmetric: DVars_{d,i,j} = Dvars_{d,j,i}. So we can compute it by filling in the p-th block column and then copy that data into the p-th block row.

Additionally, the derivative terms represent matrix-matrix products: C_{l,j} = K^-1_{l,k} * Ks_{k,j} (and K^-1_{k,l} * Ks_{l,i}, which is just a change of index labels) is a matrix product. We compute this using back-substitutions to avoid explicitly forming K^-1. C_{l,j} is num_sampled X num_to_sample.

Then D_{d,i=p,j} = \pderiv{Ks_{i=p,l}}{Xs_{d,p}} * C_{l,j} is another matrix product (result size dim * num_to_sample) (i = p indicates that index i collapses out since this deriv term is zero if p != i). Note that we store \pderiv{Ks_{i=p,l}}{Xs_{d,p}} = \pderiv{Ks_{l,i=p}}{Xs_{d,p}} as A_{d,l,i} and grab the i = p-th block.

Again, only the p-th point of points_to_sample is differentiated against; p specfied in diff_index.

void ComputeGradCholeskyVarianceOfPointsPerPoint(StateType * points_to_sample_state, int diff_index, double const *restrict chol_var, double *restrict grad_chol)

Computes the gradient of the cholesky factorization of the variance of this GP with respect to the diff_index-th point in points_to_sample.

This internal method is meant to be used by ComputeGradCholeskyVarianceOfPoints() to construct the gradient wrt all points of points_to_sample. See that function for more details.

Parameters:
points_to_sample_state[1]:
 ptr to a FULLY CONFIGURED PointsToSampleState (configure via PointsToSampleState::SetupState)
diff_index:index of points_to_sample in {0, .. num_to_sample-1} to be differentiated against
chol_var[num_to_sample][num_to_sample]:
 the variance (matrix) of this GP at each point of Xs (points_to_sample) e.g., from the cholesky factorization of ComputeVarianceOfPoints
Outputs:
points_to_sample_state[1]:
 ptr to a FULLY CONFIGURED PointsToSampleState; only temporary state may be mutated
grad_chol[dim][num_to_sample][num_to_sample]:
 gradient of the cholesky-factored variance of the GP. grad_chol[d][i][j] is actually the gradients of var_{i,j} with respect to x_{d,k}, the d-th dimension of the k-th entry of points_to_sample, where k = diff_index

Differentiates the cholesky factorization of the GP variance.

Vars = Kss - (V^T * V) (see ComputeVarianceOfPoints)
C * C^T = Vars

This function differentiates C wrt the p-th point of points_to_sample; p specfied in diff_index

Just as users of a lower triangular matrix L[i][j] should not access the upper triangle (j > i), users of the result of this function, grad_chol[d][i][j], should not access the upper block triangle with j > i.

See Smith 1995 for full details of computing gradients of the cholesky factorization

void RecomputeDerivedVariables()

Recomputes (including resizing as needed) the derived quantities in this class. This function should be called any time state variables are changed.

Private Members

int dim_

spatial dimension (e.g., entries per point of points_sampled)

int num_sampled_

number of points in points_sampled

std::unique_ptr< CovarianceInterface > covariance_ptr_

covariance class (for computing covariance and its gradients)

std::vector< double > points_sampled_

coordinates of already-sampled points, X

std::vector< double > points_sampled_value_

function values at points_sampled, y

std::vector< double > noise_variance_

\sigma_n^2, the noise variance

std::vector< double > K_chol_

cholesky factorization of K (i.e., K(X,X) covariance matrix (prior), includes noise variance)

std::vector< double > K_inv_y_

K^-1 * y; computed WITHOUT forming K^-1

NormalGeneratorType normal_rng_

Normal PRNG for use with sampling points from GP.

class OnePotentialSampleExpectedImprovementEvaluator

This is a specialization of the ExpectedImprovementEvaluator class for when the number of potential samples is 1; i.e., num_to_sample == 1 and the number of concurrent samples is 0; i.e. num_being_sampled == 0. In other words, this class only supports the computation of 1,0-EI. In this case, we have analytic formulas for computing EI and its gradient.

Thus this class does not perform any explicit numerical integration, nor do its EI functions require access to a random number generator.

This class’s methods have some parameters that are unused or redundant. This is so that the interface matches that of the more general ExpectedImprovementEvaluator.

For other details, see ExpectedImprovementEvaluator for more complete description of what EI is and the outputs of EI and grad EI computations.

Public Type
Public Functions

OnePotentialSampleExpectedImprovementEvaluator(const GaussianProcess & gaussian_process_in, double best_so_far)

Constructs a OnePotentialSampleExpectedImprovementEvaluator object. All inputs are required; no default constructor nor copy/assignment are allowed.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
best_so_far:best (minimum) objective function value (in points_sampled_value)

int dim()

const GaussianProcess * gaussian_process()

double ComputeObjectiveFunction(StateType * ei_state)

Wrapper for ComputeExpectedImprovement(); see that function for details.

void ComputeGradObjectiveFunction(StateType * ei_state, double *restrict grad_EI)

Wrapper for ComputeGradExpectedImprovement(); see that function for details.

double ComputeExpectedImprovement(StateType * ei_state)

Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+]

Uses analytic formulas to evaluate the expected improvement.

Parameters:
ei_state[1]:properly configured state object
Outputs:
ei_state[1]:state with temporary storage modified
Returns:
the expected improvement from sampling point_to_sample

Uses analytic formulas to compute EI when num_to_sample = 1 and num_being_sampled = 0 (occurs only in 1,0-EI). In this case, the single-parameter (posterior) GP is just a Gaussian. So the integral in EI (previously eval’d with MC) can be computed ‘exactly’ using high-accuracy routines for the pdf & cdf of a Gaussian random variable.

See Ginsbourger, Le Riche, and Carraro.

void ComputeGradExpectedImprovement(StateType * ei_state, double *restrict grad_EI exp_grad_EI)

Computes the (partial) derivatives of the expected improvement with respect to the point to sample.

Uses analytic formulas to evaluate the spatial gradient of the expected improvement.

Parameters:
ei_state[1]:properly configured state object
Outputs:
ei_state[1]:state with temporary storage modified
grad_EI[dim]:gradient of EI, \pderiv{EI(x)}{x_d}, where x is points_to_sample

Differentiates OnePotentialSampleExpectedImprovementEvaluator::ComputeExpectedImprovement wrt points_to_sample (which is just ONE point; i.e., 1,0-EI). Again, this uses analytic formulas in terms of the pdf & cdf of a Gaussian since the integral in EI (and grad EI) can be evaluated exactly for this low dimensional case.

See Ginsbourger, Le Riche, and Carraro.

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(OnePotentialSampleExpectedImprovementEvaluator)

Public Static Attributes

constexpr double kMinimumVarianceEI

Minimum allowed variance value in the “1D” analytic EI computation. Values that are too small result in problems b/c we may compute std_dev/var (which is enormous if std_dev = 1.0e-150 and var = 1.0e-300) since this only arises when we fail to compute std_dev = var = 0.0. Note: this is only relevant if noise = 0.0; this minimum will not affect EI computation with noise since this value is below the smallest amount of noise users can meaningfully add. This is the smallest possible value that prevents the denominator (best_so_far - mean) / sqrt(variance) from being 0. 1D analytic EI is simple and no other robustness considerations are needed.

constexpr double kMinimumVarianceGradEI

Minimum allowed variance value in the “1D” analytic grad EI computation. See kMinimumVarianceEI for more details. This value was chosen so its sqrt would be a little larger than GaussianProcess::kMinimumStdDev (by ~12x). The 150.0 was determined by numerical experiment with the setup in EIOnePotentialSampleEdgeCasesTest in order to find a setting that would be robust (no 0/0) while introducing minimal error.

Private Members

const int dim_

spatial dimension (e.g., entries per point of points_sampled)

double best_so_far_

best (minimum) objective function value (in points_sampled_value)

const boost::math::normal_distribution< double > normal_

normal distribution object

const GaussianProcess * gaussian_process_

pointer to gaussian process used in EI computations

class OnePotentialSampleExpectedImprovementState

State object for OnePotentialSampleExpectedImprovementEvaluator. This tracks the ONE point_to_sample being evaluated via expected improvement.

This is just a special case of ExpectedImprovementState; see those class docs for more details. See general comments on State structs in gpp_common.hpp‘s header docs.

Public Type
Public Functions

OnePotentialSampleExpectedImprovementState(const EvaluatorType & ei_evaluator, double const *restrict point_to_sample_in, bool configure_for_gradients)

Constructs an OnePotentialSampleExpectedImprovementState object for the purpose of computing EI (and its gradient) over the specified point to sample. This establishes properly sized/initialized temporaries for EI computation, including dependent state from the associated Gaussian Process (which arrives as part of the ei_evaluator).

Warning

This object is invalidated if the associated ei_evaluator is mutated. SetupState() should be called to reset.

Warning

Using this object to compute gradients when configure_for_gradients := false results in UNDEFINED BEHAVIOR.

Parameters:
ei_evaluator:expected improvement evaluator object that specifies the parameters & GP for EI evaluation
point_to_sample[dim]:
 point at which to evaluate EI and/or its gradient to check their value in future experiments (i.e., test point for GP predictions)
configure_for_gradients:
 true if this object will be used to compute gradients, false otherwise

OnePotentialSampleExpectedImprovementState(const EvaluatorType & ei_evaluator, double const *restrict points_to_sample, double const *restrict OL_UNUSED, int OL_UNUSED, int OL_UNUSED, bool configure_for_gradients, NormalRNGInterface * OL_UNUSED)

Constructor wrapper to match the signature of the ctor for ExpectedImprovementState().

OnePotentialSampleExpectedImprovementState(OnePotentialSampleExpectedImprovementState && other)

int GetProblemSize()

void GetCurrentPoint(double *restrict point_to_sample_out)

Get point_to_sample: the potential future sample whose EI (and/or gradients) is being evaluated

Outputs:
point_to_sample[dim]:
 potential sample whose EI is being evaluted

void SetCurrentPoint(const EvaluatorType & ei_evaluator, double const *restrict point_to_sample_in)

Change the potential sample whose EI (and/or gradient) is being evaluated. Update the state’s derived quantities to be consistent with the new point.

Parameters:
ei_evaluator:expected improvement evaluator object that specifies the parameters & GP for EI evaluation
point_to_sample[dim]:
 potential future sample whose EI (and/or gradients) is being evaluated

void SetupState(const EvaluatorType & ei_evaluator, double const *restrict point_to_sample_in)

Configures this state object with a new point_to_sample, the location of the potential sample whose EI is to be evaluated. Ensures all state variables & temporaries are properly sized. Properly sets all dependent state variables (e.g., GaussianProcess’s state) for EI evaluation.

Warning

This object’s state is INVALIDATED if the ei_evaluator (including the GaussianProcess it depends on) used in SetupState is mutated! SetupState() should be called again in such a situation.

Parameters:
ei_evaluator:expected improvement evaluator object that specifies the parameters & GP for EI evaluation
point_to_sample[dim]:
 potential future sample whose EI (and/or gradients) is being evaluated

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(OnePotentialSampleExpectedImprovementState)

Public Members

const int dim

spatial dimension (e.g., entries per point of points_sampled)

const int num_to_sample

number of points to sample (i.e., the “q” in q,p-EI); MUST be 1

const int num_derivatives

number of derivative terms desired (usually 0 for no derivatives or num_to_sample)

std::vector< double > point_to_sample

point at which to evaluate EI and/or its gradient (e.g., to check its value in future experiments)

GaussianProcess::StateType points_to_sample_state

gaussian process state

std::vector< double > grad_mu

the gradient of the GP mean evaluated at point_to_sample, wrt point_to_sample

std::vector< double > grad_chol_decomp

the gradient of the sqrt of the GP variance evaluated at point_to_sample wrt point_to_sample

class PointsToSampleState

This object holds the state needed for a GaussianProcess object characterize the distribution of function values arising from sampling the GP at a list of points_to_sample. This object is required by the GaussianProcess to access functionality for computing the mean, variance, and spatial gradients thereof.

The “independent variables” for this object are points_to_sample. These points are both the “p” and the “q” in q,p-EI; i.e., they are the parameters of both ongoing experiments and new predictions. Recall that in q,p-EI, the q points are called points_to_sample and the p points are called points_being_sampled. Here, we need to make predictions about both point sets with the GP, so we simply call the union of point sets points_to_sample.

In GP computations, there is really no distinction between the “q” and “p” points from EI, points_to_sample and points_being_sampled, respectively. However, in EI optimization, we only need gradients of GP quantities wrt points_to_sample, so users should build PointsToSampleState() with num_derivatives = num_to_sample.

Once constructed, this object provides the SetupState() function to update it for computations at different sets of potential points to sample.

See general comments on State structs in gpp_common.hpp‘s header docs.

Public Functions

PointsToSampleState(const GaussianProcess & gaussian_process, double const *restrict points_to_sample_in, int num_to_sample_in, int num_derivatives_in)

Constructs a PointsToSampleState object with new points_to_sample. Ensures all state variables & temporaries are properly sized. Properly sets all state variables so that GaussianProcess’s mean, variance (and gradients thereof) functions can be called.

Warning

This object’s state is INVALIDATED if the gaussian_process used in construction is mutated! SetupState() should be called again in such a situation.

Warning

Using this object to compute gradients when num_derivatives := 0 results in UNDEFINED BEHAVIOR.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
points_to_sample[dim][num_to_sample]:
 points at which to compute GP-derived quantities (mean, variance, etc.)
num_to_sample:number of points being sampled concurrently
num_derivatives:
 configure this object to compute num_derivatives derivative terms wrt points_to_sample[:][0:num_derivatives]; 0 means no gradient computation will be performed.

PointsToSampleState(PointsToSampleState && other)

void SetupState(const GaussianProcess & gaussian_process, double const *restrict points_to_sample_in, int num_to_sample_in, int num_derivatives_in)

Configures this object with new points_to_sample. Ensures all state variables & temporaries are properly sized. Properly sets all state variables so that GaussianProcess’s mean, variance (and gradients thereof) functions can be called.

Warning

This object’s state is INVALIDATED if the gaussian_process used in SetupState is mutated! SetupState() should be called again in such a situation.

Warning

Using this object to compute gradients when num_derivatives := 0 results in UNDEFINED BEHAVIOR.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
points_to_sample[dim][num_to_sample]:
 points at which to compute GP-derived quantities (mean, variance, etc.)
num_to_sample:number of points being sampled concurrently
num_derivatives:
 configure this object to compute num_derivatives derivative terms wrt points_to_sample[:][0:num_derivatives]; 0 means no gradient computation will be performed.

OL_DISALLOW_DEFAULT_AND_COPY_AND_ASSIGN(PointsToSampleState)

Public Members

const int dim

spatial dimension (e.g., entries per point of points_sampled)

int num_sampled

number of points alerady sampled

int num_to_sample

number of points currently being sampled

int num_derivatives

this object can compute num_derivatives derivative terms wrt points_to_sample[:][0:num_derivatives]; 0 means no gradient computation will be performed

std::vector< double > points_to_sample

points to make predictions about, Xs

std::vector< double > K_star

the “mixed” covariance matrix: Ks, Ks_{ij}, K(X,Xs) = covariance(X_i, Xs_j), covariance matrix between training and test inputs (num_sampled x num_to_sample)

std::vector< double > grad_K_star

the gradient of mixed covariance matrix, Ks, wrt Xs

std::vector< double > V

the variance matrix (output from the GP)

std::vector< double > K_inv_times_K_star

K^{-1} Ks (computed without taking an inverse)

std::vector< double > grad_cov

the gradient of covariance(x_1, x_2) wrt x_1

gpp_math.cpp

These comments are getting to be of some length, so here’s a table of contents:

  1. FILE OVERVIEW
  2. IMPLEMENTATION NOTES
  3. MATHEMATICAL OVERVIEW
    1. GAUSSIAN PROCESSES
    2. SAMPLING FROM GPs
    3. EXPECTED IMPROVEMENT
  4. CODE DESIGN/LAYOUT OVERVIEW:
    1. class GaussianProcess
    2. class ExpectedImprovementEvaluator, OnePotentialSampleExpectedImprovementEvaluator
    3. function ComputeOptimalPointsToSampleWithRandomStarts()
  5. CODE HIERARCHY / CALL-TREE

1. FILE OVERVIEW

Implementations of functions for Gaussian Processes (mean, variance of GPs and their gradients) and for computing and optimizing Expected Improvement (EI).

2. IMPLEMENTATION NOTES

See gpp_math.hpp file docs and gpp_common.hpp for a few important implementation notes (e.g., restrict, memory allocation, matrix storage style, etc), as well as citation details.

Additionally, the matrix looping idioms used in this file deserve further mention: see gpp_common.hpp header comments, item 7 for further details. In summary, using matrix-vector-multiply as an example, we do:

for (int i = 0; i < m; ++i) {
  y[i] = 0;
  for (int j = 0; j < n; ++j) {
    y[i] += A[j]*x[j];
  }
  A += n;
}

3. MATHEMATICAL OVERVIEW

Next, we provide a high-level discussion of GPs and the EI optimization process used in this file. See Rasmussen & Williams for more details on the former and Scott Clark’s thesis for details on the latter. This segment is more focused on concepts and mathematical ideas. We subsequently discuss how the classes and functions in this file map onto these mathematical concepts. If it wasn’t clear, please read the file comments for gpp_math.hpp before continuing (a conceptual overview).

3a. GAUSSIAN PROCESSES

First, a Gaussian Process (GP) is defined as a collection of normally distributed random variables (RVs); these RVs are not independent nor identically-distributed (i.e., all normal but different mean/var) in general. Since the GP is a collection of RVs, it defines a distribution over FUNCTIONS. So drawing from the GP realizes one particular function.

Now let X = training data; these are our experimental independent variables let f = training data observed values; this is our (SCALAR) dependent-variable So for (X_i, f_i) pairs, we say:

f ~ GP(0, cov(X,X)) /equiv N(0, cov(X,X))

the training data, f, is distributed like a (multi-variate) Gaussian with mean 0 and variance = cov(X,X). Drawing from this GP requires conditioning on the result satisfying the training data. That is, the realized function must pass through all points (X,f). Between these, “essentially any” behavior is possible, although certain behaviors are more likely as specified via cov(X,X). Note that the GP has 0 mean (and no signal variance) to specify that it passes through X,f exactly. Nonzero mean would shift the entire distribution so that it passes through (X,f+mu).

In the following, K(X,X) is the covariance function. It’s given as an input to this whole process and is critical in informing the behavior of the GP. The covariance function describes how related we (a priori) believe prior points are to each other. In code, the covariance function is specified through the CovarianceInterface class.

In a noise-free setting (signal noise modifies K to become K + \sigma^2 * Id, Id being identity), the joint distribution of training inputs, f, and test outputs, fs, is:

[ f  ]  ~ N( 0, [ K(X,X)   K(X,Xs)  ]  = [ K     Ks  ]         (Equation 1, Rasmussen & Williams 2.18)
[ fs ]          [ K(Xs,X)  K(Xs,Xs) ]    [ Ks^T  Kss ]

where the test outputs are drawn from the prior.

K(X,X) and K(Xs,Xs) are computed in BuildCovarianceMatrix()
K(X,Xs) is computed by BuildMixCovarianceMatrix(); and K(Xs,X) is its transpose.
K + \sigma^2 is computed in BuildCovarianceMatrixWithNoiseVariance(); almost all practical uses of GPs and EI will

be over data with nonzero noise variance. However this is immaterial to the rest of the discussion here.

3b. SAMPLING FROM GPs

So to obtain the posterior distribution, fs, we again sample this joint prior and throw out any function realizations that do not satisfy the observations (i.e., pass through all (X,f) pairs). This is expensive.

Instead, we can use math to compute the posterior by conditioning it on the prior:

fs | Xs,X,f ~ N( mus, Vars)

where mus = K(Xs,X) * K(X,X)^-1 * f = Ks^T * K^-1 * f,  (Equation 2, Rasmussen & Williams 2.19) which is computed in GaussianProcess::ComputeMeanOfPoints.

and Vars = K(Xs,Xs) - K(Xs,X) * K(X,X)^-1 * K(X,Xs) = Kss - Ks^T * K^-1 * Ks, (Equation 3, Rasumussen & Williams 2.19) which is implemented in GaussianProcess::ComputeVarianceOfPoints (and provably SPD).

Now we can draw from this multi-variate Gaussian by:

y = mus + L * w    (Equation 4)

where L * L^T = Vars (cholesky-factorization) and w is a vector of samples from N(0,1) Note that if our GP has 10 dimensions (variables), then y contains 10 sample values.

3c. EXPECTED IMPROVEMENT

Note

these comments are copied in Python: interfaces/expected_improvement_interface.py

Then the improvement for this single sample is:

I = { best_known - min(y)   if (best_known - min(y) > 0)      (Equation 5)
    {          0               else

And the expected improvement, EI, can be computed by averaging repeated computations of I; i.e., monte-carlo integration. This is done in ExpectedImprovementEvaluator::ComputeExpectedImprovement(); we can also compute the gradient. This computation is needed in the optimization of q,p-EI.

There is also a special, analytic case of EI computation that does not require monte-carlo integration. This special case can only be used to compute 1,0-EI (and its gradient). Still this can be very useful (e.g., the heuristic optimization in gpp_heuristic_expected_improvement_optimization.hpp estimates q,0-EI by repeatedly solving 1,0-EI).

From there, since EI is taken from a sum of gaussians, we expect it to be reasonably smooth and apply multistart, restarted gradient descent to find the optimum. The use of gradient descent implies the need for all of the various “grad” functions, e.g., GP::ComputeGradMeanOfPoints(). This is handled starting in the highest level functions of file, ComputeOptimalPointsToSample().

4. CODE OVERVIEW

Finally, we give some further details about how the previous ideas map into the code. We begin with an overview of important classes and functions in this file, and end by going over the call stack for the EI optimization entry point.

4a. First, the GaussianProcess (GP) class

The GaussianProcess class abstracts the handling of GPs and their properties; quickly going over the functionality: it provides methods for computing mean, variance, cholesky of variance, and their gradients (wrt spatial dimensions). GP also allows the user to sample function values from it, distributed according to the GP prior. Lastly GP provides the ability to change the hyperparameters of its covariance function (although currently you cannot change the covariance function; this would not be difficult to add).

Computation-wise, GaussianProcess also makes precomputation and preallocation convenient. The class tracks all of its inputs (e.g., X, f, noise var, covariance) as well as quantities that are derivable from only these inputs; e.g., K, cholesky factorization of K, K^-1*y. Thus repeated calculations with the GP over the same training data avoids (very expensive) factorizations of K.

A last note about GP: it uses the State idiom laid out in gpp_common.hpp. The associated state is PointsToSampleState. PointsToSampleState tracks the current “test” data set, points_to_sample–the set of currently running experiments, possibly including the current point(s) being optimized. In the q,p-EI terminology, PointsToSampleState tracks the union of points_to_sample and points_being_sampled. PointsToSampleState preallocates all vectors needed by GP’s member functions; it also precomputes (per points_to_sample update) some derived quantities that are used repeatedly by GP member functions.

In current usage, users generally will not need to access GaussianProcess’s member functions directly; instead these are used indirectly when users compute or optimize EI. Plotting/visualization might be one reason to call GP members directly.

4b. Next, the ExpectedImprovementEvaluator and OnePotentialSampleExpectedImprovementEvaulator classes

ExpectedImprovementEvaluator abstracts the computation of EI and its gradient. This class references a single GaussianProcess that it uses to compute EI/grad EI as described above. Equations 4, 5 above detailed the EI computation; further details can be found below in the call tree discussion as well as in the implementation docs for these functions. The gradient of EI is implemented similarly; see implementation docs for details on the one subtlety.

OnePotentialSample is a special case of ExpectedImprovementEvaluator. With num_to_sample = 1 and num_being_sampled = 0 (only occurs in 1,0-EI evaluation/optimization), there is only one experiment to worry about and no concurrent events. This simplifies the EI computation substantially (multi-dimensional Gaussians become a simple one dimensional case) and we can write EI analytically in terms of the PDF and CDF of a N(0,1) normal distribution (which are evaluated numerically by boost). No monte-carlo necessary!

ExpectedImprovementEvaluator and OnePotentialSample have corresponding State classes as well. These are similar to each other except OnePotentialSample does not have a NormalRNG pointer (since it does no MC integration) and some temporaries are dropped since they have size 1. But for the general EI’s State, the NormalRNG pointer must reference a different object for each thread! Notably, both EI State classes construct their own GaussianProcess::StateType object for use with GP members. As long as there is only one EI state per thread, This ensures thread safety since there is never a reason (or a way) for multiple threads to accidentally use the same GP state. Finally, the EI state classes hold some pre-allocated vectors for use as local temporaries by EI and GradEI computation.

4c. And finally, we discuss selecting optimal experiments with ComputeOptimalPointsToSampleWithRandomStarts()

This function is the top of the hierarchy for EI optimization. It encompasses a multistart, restarted gradient descent method. Since this is not a convex optimization problem, there could be multiple local optima (or even 0 optima). So we start GD from multiple locations (multistart) as a heuristic in hopes of finding the global optima.

See the file comments of gpp_optimization.hpp for more details on the base gradient descent implementation and the restart component of restarted gradient descent.

5. CODE HIERARCHY / CALL-TREE

For obtaining multiple new points to sample (q,p-EI), we have two main paths for optimization: multistart gradient descent and ‘dumb’ search. The optimization hierarchy looks like (these optimization functions are in the header; they are templates): ComputeOptimalPointsToSampleWithRandomStarts<...>(...) (selects random points; defined in math.hpp)

  • Solves q,p-EI.

  • Selects random starting locations based on random sampling from the domain (e.g., latin hypercube)

  • This calls:

    ComputeOptimalPointsToSampleViaMultistartGradientDescent<...>(...) (multistart gradient descent)

    • Switches into analytic OnePotentialSample case when appropriate

    • Multithreaded over starting locations

    • Optimizes with restarted gradient descent; collects results and updates the solution as new optima are found

    • This calls:

      MultistartOptimizer<...>::MultistartOptimize(...) for multistarting (see gpp_optimization.hpp) which in turn uses GradientDescentOptimizer::Optimize<ObjectiveFunctionEvaluator, Domain>() (see gpp_optimization.hpp)

ComputeOptimalPointsToSampleViaLatinHypercubeSearch<...>(...) (defined in gpp_math.hpp)

  • Estimates q,p-EI with a ‘dumb’ search.

  • Selects random starting locations based on random sampling from the domain (e.g., latin hypercube)

  • This calls:

    EvaluateEIAtPointList<...>(...)

    • Evaluates EI at each starting location

    • Switches into analytic OnePotentialSample case when appropriate

    • Multithreaded over starting locations

    • This calls:

      MultistartOptimizer<...>::MultistartOptimize(...) for multistarting (see gpp_optimization.hpp)

ComputeOptimalPointsToSample<...>(...) (defined in gpp_math.cpp)

  • Solves q,p-EI
  • Tries ComputeOptimalPointsToSampleWithRandomStarts() first.
  • If that fails, switches to ComputeOptimalPointsToSampleViaLatinHypercubeSearch().

So finally we will overview the function calls for EI calculation. We limit our discussion to the general MC case; the analytic case is similar and simpler. ExpectedImprovementEvaluator::ComputeExpectedImprovement() (computes EI)

  • Computes GP.mean, GP.variance, cholesky(GP.variance)
  • MC integration: samples from the GP repeatedly (Equation 4) and computes the improvement (Equation 5), averaging the result See function comments for more details.
  • Calls out to GP::ComputeMeanOfPoints(), GP:ComputeVarianceOfPoints, ComputeCholeskyFactorL, NormalRNG::operator(), and TriangularMatrixVectorMultiply

ExpectedImprovementEvaluator::ComputeGradExpectedImprovement() (computes gradient of EI)

  • Compute GP.mean, variance, cholesky(variance), grad mean, grad variance, grad cholesky variance
  • MC integration: Equation 4, 5 as before to compute improvement each step Only have grad EI contributions when improvement > 0. Care is needed because only the point yielding the largest improvement contributes to the gradient. See function comments for more details.

We will not detail the call tree once inside of GaussianProcess. The mathematical formulas for the mean and variance were already described above (Equation 2, 3). Function docs (in this file) further detail/cite the formulas and relevant derivations for gradients of these quantities. Suffice to say there’s a lot of linear algebra. Read on (to those fcn docs) for further details but this does little to expose the important concepts behind EI and GP.

Defines

OL_CHOL_VAR(i, j)

OL_GRAD_CHOL(m, i, j)

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.

Functions

void EvaluateEIAtPointList(const GaussianProcess & gaussian_process, const ThreadSchedule & thread_schedule, double const *restrict initial_guesses, double const *restrict points_being_sampled, int num_multistarts, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, bool *restrict found_flag, NormalRNG * normal_rng, double *restrict function_values, double *restrict best_next_point)

Routes the EI computation through MultistartOptimizer + NullOptimizer to perform EI function evaluations at the list of input points, using the appropriate EI evaluator (e.g., monte carlo vs analytic) depending on inputs.

Function to evaluate Expected Improvement (q,p-EI) over a specified list of num_multistarts points. Optionally outputs the EI at each of these points. Outputs the point of the set obtaining the maximum EI value.

Generally gradient descent is preferred but when they fail to converge this may be the only “robust” option. This function is also useful for plotting or debugging purposes (just to get a bunch of EI values).

This function is just a wrapper that builds the required state objects and a NullOptimizer object and calls MultistartOptimizer<...>::MultistartOptimize(...); see gpp_optimization.hpp.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_static), chunk_size (0).
initial_guesses[dim][num_to_sample][num_multistarts]:
 list of points at which to compute EI
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrent experiments
num_multistarts:
 number of points to check
num_to_sample:number of potential future samples; gradients are evaluated wrt these points (i.e., the “q” in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the “p” in q,p-EI)
best_so_far:value of the best sample so far (must be min(points_sampled_value))
max_int_steps:maximum number of MC iterations
normal_rng[thread_schedule.max_num_threads]:
 a vector of NormalRNG objects that provide the (pesudo)random source for MC integration
Outputs:
found_flag[1]:true if best_next_point corresponds to a nonzero EI
normal_rng[thread_schedule.max_num_threads]:
 NormalRNG objects will have their state changed due to random draws
function_values[num_multistarts]:
 EI evaluated at each point of initial_guesses, in the same order as initial_guesses; never dereferenced if nullptr
best_next_point[dim][num_to_sample]:
 points yielding the best EI according to dumb search

template < typename DomainType >
void ComputeOptimalPointsToSample(const GaussianProcess & gaussian_process, const GradientDescentParameters & optimizer_parameters, const DomainType & domain, const ThreadSchedule & thread_schedule, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, bool lhc_search_only, int num_lhc_samples, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, NormalRNG * normal_rng, double *restrict best_points_to_sample)

This is a simple wrapper around ComputeOptimalPointsToSampleWithRandomStarts() and ComputeOptimalPointsToSampleViaLatinHypercubeSearch(). That is, this method attempts multistart gradient descent and falls back to latin hypercube search if gradient descent fails (or is not desired).

TODO(GH-77): Instead of random search, we may want to fall back on the methods in gpp_heuristic_expected_improvement_optimization.hpp if gradient descent fails; esp for larger q (even q \approx 4), latin hypercube search does a pretty terrible job. This is more for general q,p-EI as these two things are equivalent for 1,0-EI.

Solve the q,p-EI problem (see header docs) by optimizing the Expected Improvement. Uses multistart gradient descent, “dumb” search, and/or other heuristics to perform the optimization.

This is the primary entry-point for EI optimization in the optimal_learning library. It offers our best shot at improving robustness by combining higher accuracy methods like gradient descent with fail-safes like random/grid search.

Returns the optimal set of q points to sample CONCURRENTLY by solving the q,p-EI problem. That is, we may want to run 4 experiments at the same time and maximize the EI across all 4 experiments at once while knowing of 2 ongoing experiments (4,2-EI). This function handles this use case. Evaluation of q,p-EI (and its gradient) for q > 1 or p > 1 is expensive (requires monte-carlo iteration), so this method is usually very expensive.

Wraps ComputeOptimalPointsToSampleWithRandomStarts() and ComputeOptimalPointsToSampleViaLatinHypercubeSearch().

Compared to ComputeHeuristicPointsToSample() (gpp_heuristic_expected_improvement_optimization.hpp), this function makes no external assumptions about the underlying objective function. Instead, it utilizes a feature of the GaussianProcess that allows the GP to account for ongoing/incomplete experiments.

Note

These comments were copied into multistart_expected_improvement_optimization() in cpp_wrappers/expected_improvement.py.

Parameters:
gaussian_process:
 GaussianProcess object (holds points_sampled, values, noise_variance, derived quantities) that describes the underlying GP
optimizer_parameters:
 GradientDescentParameters object that describes the parameters controlling EI optimization (e.g., number of iterations, tolerances, learning rate)
domain:object specifying the domain to optimize over (see gpp_domain.hpp)
thread_schedule:
 struct instructing OpenMP on how to schedule threads; i.e., (suggestions in parens) max_num_threads (num cpu cores), schedule type (omp_sched_dynamic), chunk_size (0).
points_being_sampled[dim][num_being_sampled]:
 points that are being sampled in concurrent experiments
num_to_sample:how many simultaneous experiments you would like to run (i.e., the q in q,p-EI)
num_being_sampled:
 number of points being sampled concurrently (i.e., the p in q,p-EI)
best_so_far:value of the best sample so far (must be min(points_sampled_value))
max_int_steps:maximum number of MC iterations
lhc_search_only:
 whether to ONLY use latin hypercube search (and skip gradient descent EI opt)
num_lhc_samples:
 number of samples to draw if/when doing latin hypercube search
uniform_generator[1]:
 a UniformRandomGenerator object providing the random engine for uniform random numbers
normal_rng[thread_schedule.max_num_threads]:
 a vector of NormalRNG objects that provide the (pesudo)random source for MC integration
Outputs:
found_flag[1]:true if best_points_to_sample corresponds to a nonzero EI if sampled simultaneously
uniform_generator[1]:
 UniformRandomGenerator object will have its state changed due to random draws
normal_rng[thread_schedule.max_num_threads]:
 NormalRNG objects will have their state changed due to random draws
best_points_to_sample[num_to_sample*dim]:
 point yielding the best EI according to MGD

template void ComputeOptimalPointsToSample(const GaussianProcess & gaussian_process, const GradientDescentParameters & optimizer_parameters, const TensorProductDomain & domain, const ThreadSchedule & thread_schedule, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, bool lhc_search_only, int num_lhc_samples, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, NormalRNG * normal_rng, double *restrict best_points_to_sample)

template void ComputeOptimalPointsToSample(const GaussianProcess & gaussian_process, const GradientDescentParameters & optimizer_parameters, const SimplexIntersectTensorProductDomain & domain, const ThreadSchedule & thread_schedule, double const *restrict points_being_sampled, int num_to_sample, int num_being_sampled, double best_so_far, int max_int_steps, bool lhc_search_only, int num_lhc_samples, bool *restrict found_flag, UniformRandomGenerator * uniform_generator, NormalRNG * normal_rng, double *restrict best_points_to_sample)