moe.optimal_learning.python.cpp_wrappers package

Submodules

moe.optimal_learning.python.cpp_wrappers.covariance module

Thin covariance-related data containers that can be passed to cpp_wrappers.* functions/classes requiring covariance data.

C++ covariance objects currently do not expose their members to Python. Additionally although C++ has several covariance functions available, runtime-selection is not yet implemented. The containers here just track the hyperparameters of covariance functions in a format that can be interpreted in C++ calls.

class moe.optimal_learning.python.cpp_wrappers.covariance.SquareExponential(hyperparameters)[source]

Bases: moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface

Implement the square exponential covariance function.

The function: cov(x_1, x_2) = \alpha * \exp(-1/2 * ((x_1 - x_2)^T * L * (x_1 - x_2)) ) where L is the diagonal matrix with i-th diagonal entry 1/lengths[i]/lengths[i]

This covariance object has dim+1 hyperparameters: \alpha, lengths_i

covariance(point_one, point_two)[source]

Compute the covariance function of two points, cov(point_one, point_two).

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.covariance_interface for interface specification.

covariance_type = 'square_exponential'
get_hyperparameters()[source]

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

get_json_serializable_info()[source]

Create and return a covariance_info dictionary of this covariance object.

grad_covariance(point_one, point_two)[source]

Compute the gradient of self.covariance(point_one, point_two) with respect to the FIRST argument, point_one.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.covariance_interface for interface specification.

hyperparameter_grad_covariance(point_one, point_two)[source]

Compute the gradient of self.covariance(point_one, point_two) with respect to its hyperparameters.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.covariance_interface for interface specification.

hyperparameter_hessian_covariance(point_one, point_two)[source]

Compute the hessian of self.covariance(point_one, point_two) with respect to its hyperparameters.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.covariance_interface for interface specification.

hyperparameters

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

static make_default_hyperparameters(dim)[source]

Return a default set up hyperparameters given the dimension of the space.

num_hyperparameters[source]

Return the number of hyperparameters of this covariance function.

set_hyperparameters(hyperparameters)[source]

Set hyperparameters to the specified hyperparameters; ordering must match.

moe.optimal_learning.python.cpp_wrappers.cpp_utils module

Utilities for making data C++ consumable and for making C++ outputs Python consumable.

moe.optimal_learning.python.cpp_wrappers.cpp_utils.cppify(array)[source]

Flatten a numpy array and copies it to a list for C++ consumption.

TODO(GH-159): This function will be unnecessary when C++ accepts numpy arrays.

Parameters:array (array-like (e.g., ndarray, list, etc.) of float64) – array to convert
Returns:list copied from flattened array
Return type:list
moe.optimal_learning.python.cpp_wrappers.cpp_utils.cppify_hyperparameters(hyperparameters)[source]

Convert a flat array of hyperparameters into a form C++ can consume.

C++ interface expects hyperparameters in a list, where: hyperparameters[0]: float64 = \alpha (\sigma_f^2, signal variance) hyperparameters[1]: list = length scales (len = dim, one length per spatial dimension)

Parameters:hyperparameters (array of float64 with shape (num_hyperparameters)) – hyperparameters to convert
Returns:hyperparameters converted to C++ input format
Return type:list where item [0] is a float and item [1] is a list of float with len = dim
moe.optimal_learning.python.cpp_wrappers.cpp_utils.uncppify(array, expected_shape)[source]

Reshape a copy of the input array into the expected shape.

TODO(GH-159): If C++ returns numpy arrays, we can kill this function (instead, call reshape directly).

Parameters:
  • array (array-like) – array to reshape
  • expected_shape (int or tuple of ints) – desired shape for array
Returns:

reshaped input

Return type:

array of float64 with shape expected_shape

moe.optimal_learning.python.cpp_wrappers.domain module

Thin domain-related data containers that can be passed to cpp_wrappers.* functions/classes requiring domain data.

C++ domain objects currently do not expose their members to Python. So the classes in this file track the data necessary for C++ calls to construct the matching C++ domain object.

class moe.optimal_learning.python.cpp_wrappers.domain.SimplexIntersectTensorProductDomain(domain_bounds)[source]

Bases: moe.optimal_learning.python.interfaces.domain_interface.DomainInterface

Domain class for the intersection of the unit simplex with an arbitrary tensor product domain.

At the moment, this is just a dummy container for the domain boundaries, since the C++ object currently does not expose its internals to Python.

This object has a TensorProductDomain object as a data member and uses its functions when possible. See TensorProductDomain for what that means.

The unit d-simplex is defined as the set of x_i such that:

  1. x_i >= 0 \forall i  (i ranging over dimension)
  2. \sum_i x_i <= 1

(Implying that x_i <= 1 \forall i)

ASSUMPTION: most of the volume of the tensor product region lies inside the simplex region.

check_point_inside(point)[source]

Check if a point is inside the domain/on its boundary or outside.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

compute_update_restricted_to_domain(max_relative_change, current_point, update_vector)[source]

Compute a new update so that CheckPointInside(current_point + new_update) is true.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

dim[source]

Return the number of spatial dimensions.

domain_bounds[source]

Return the [min, max] bounds for each spatial dimension.

domain_type = 'simplex_intersect_tensor_product'
generate_uniform_random_points_in_domain(num_points, random_source)[source]

Generate AT MOST num_points uniformly distributed points from the domain.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

class moe.optimal_learning.python.cpp_wrappers.domain.TensorProductDomain(domain_bounds)[source]

Bases: moe.optimal_learning.python.interfaces.domain_interface.DomainInterface

Domain type for a tensor product domain.

A d-dimensional tensor product domain is D = [x_0_{min}, x_0_{max}] X [x_1_{min}, x_1_{max}] X ... X [x_d_{min}, x_d_{max}] At the moment, this is just a dummy container for the domain boundaries, since the C++ object currently does not expose its internals to Python.

check_point_inside(point)[source]

Check if a point is inside the domain/on its boundary or outside.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

compute_update_restricted_to_domain(max_relative_change, current_point, update_vector)[source]

Compute a new update so that CheckPointInside(current_point + new_update) is true.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

dim[source]

Return the number of spatial dimensions.

domain_bounds[source]

Return the [min, max] bounds for each spatial dimension.

domain_type = 'tensor_product'
generate_random_point_in_domain(random_source=None)[source]

Generate point uniformly at random such that self.check_point_inside(point) is True.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

generate_uniform_random_points_in_domain(num_points, random_source=None)[source]

Generate num_points uniformly distributed points from the domain.

We do not currently expose a C++ endpoint for this call; see moe.optimal_learning.python.interfaces.domain_interface for interface specification.

get_bounding_box()[source]

Return a list of ClosedIntervals representing a bounding box for this domain.

get_constraint_list()[source]

Return a list of lambda functions expressing the domain bounds as linear constraints. Used by COBYLA.

get_json_serializable_info(minimal=False)[source]

Create and return a domain_info dictionary of this domain object.

Parameters:minimal (bool) – True for all domain contents; False for domain_type and dim only
Returns:dict representation of this domain
Return type:dict

moe.optimal_learning.python.cpp_wrappers.expected_improvement module

Tools to compute ExpectedImprovement and optimize the next best point(s) to sample using EI through C++ calls.

This file contains a class to compute Expected Improvement + derivatives and a functions to solve the q,p-EI optimization problem. The moe.optimal_learning.python.cpp_wrappers.expected_improvement.ExpectedImprovement class implements moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImproventInterface. The optimization functions are convenient wrappers around the matching C++ calls.

See moe.optimal_learning.python.interfaces.expected_improvement_interface or gpp_math.hpp/cpp for further details on expected improvement.

class moe.optimal_learning.python.cpp_wrappers.expected_improvement.ExpectedImprovement(gaussian_process, points_to_sample=None, points_being_sampled=None, num_mc_iterations=10000, randomness=None)[source]

Bases: moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImprovementInterface, moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface

Implementation of Expected Improvement computation via C++ wrappers: EI and its gradient at specified point(s) sampled from a GaussianProcess.

A class to encapsulate the computation of expected improvement and its spatial gradient using points sampled from an associated GaussianProcess. The general EI computation requires monte-carlo integration; it can support q,p-EI optimization. It is designed to work with any GaussianProcess.

Note

Equivalent methods of ExpectedImprovementInterface and OptimizableInterface are aliased below (e.g., compute_expected_improvement and compute_objective_function, etc).

See moe.optimal_learning.python.interfaces.expected_improvement_interface docs for further details.

compute_expected_improvement(force_monte_carlo=False)[source]

Compute the expected improvement at points_to_sample, with points_being_sampled concurrent points being sampled.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+], where Xs are potential points to sample (union of points_to_sample and points_being_sampled) and X are already sampled points. The ^+ indicates that the expression in the expectation evaluates to 0 if it is negative. f^*(X) is the MINIMUM over all known function evaluations (points_sampled_value), whereas f(Xs) are GP-predicted function evaluations.

In words, we are computing the expected improvement (over the current best_so_far, best known objective function value) that would result from sampling (aka running new experiments) at points_to_sample with points_being_sampled concurrent/ongoing experiments.

In general, the EI expression is complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The idea of the MC approach is to repeatedly sample at the union of points_to_sample and points_being_sampled. This is analogous to gaussian_process_interface.sample_point_from_gp, but we sample num_union points at once: y = \mu + Lw where \mu is the GP-mean, L is the chol_factor(GP-variance) and w is a vector of num_union draws from N(0, 1). Then: improvement_per_step = max(max(best_so_far - y), 0.0) Observe that the inner max means only the smallest component of y contributes in each iteration. We compute the improvement over many random draws and average.

Parameters:force_monte_carlo (boolean) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
Returns:the expected improvement from sampling points_to_sample with points_being_sampled concurrent experiments
Return type:float64
compute_grad_expected_improvement(force_monte_carlo=False)[source]

Compute the gradient of expected improvement at points_to_sample wrt points_to_sample, with points_being_sampled concurrent samples.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

In general, the expressions for gradients of EI are complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The MC computation of grad EI is similar to the computation of EI (decsribed in compute_expected_improvement). We differentiate y = \mu + Lw wrt points_to_sample; only terms from the gradient of \mu and L contribute. In EI, we computed: improvement_per_step = max(max(best_so_far - y), 0.0) and noted that only the smallest component of y may contribute (if it is > 0.0). Call this index winner. Thus in computing grad EI, we only add gradient terms that are attributable to the winner-th component of y.

Parameters:force_monte_carlo (boolean) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
Returns:gradient of EI, \pderiv{EI(Xq \cup Xp)}{Xq_{i,d}} where Xq is points_to_sample and Xp is points_being_sampled (grad EI from sampling points_to_sample with points_being_sampled concurrent experiments wrt each dimension of the points in points_to_sample)
Return type:array of float64 with shape (num_to_sample, dim)
compute_grad_objective_function(force_monte_carlo=False)

Compute the gradient of expected improvement at points_to_sample wrt points_to_sample, with points_being_sampled concurrent samples.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

In general, the expressions for gradients of EI are complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The MC computation of grad EI is similar to the computation of EI (decsribed in compute_expected_improvement). We differentiate y = \mu + Lw wrt points_to_sample; only terms from the gradient of \mu and L contribute. In EI, we computed: improvement_per_step = max(max(best_so_far - y), 0.0) and noted that only the smallest component of y may contribute (if it is > 0.0). Call this index winner. Thus in computing grad EI, we only add gradient terms that are attributable to the winner-th component of y.

Parameters:force_monte_carlo (boolean) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
Returns:gradient of EI, \pderiv{EI(Xq \cup Xp)}{Xq_{i,d}} where Xq is points_to_sample and Xp is points_being_sampled (grad EI from sampling points_to_sample with points_being_sampled concurrent experiments wrt each dimension of the points in points_to_sample)
Return type:array of float64 with shape (num_to_sample, dim)
compute_hessian_objective_function(**kwargs)[source]

We do not currently support computation of the (spatial) hessian of Expected Improvement.

compute_objective_function(force_monte_carlo=False)

Compute the expected improvement at points_to_sample, with points_being_sampled concurrent points being sampled.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+], where Xs are potential points to sample (union of points_to_sample and points_being_sampled) and X are already sampled points. The ^+ indicates that the expression in the expectation evaluates to 0 if it is negative. f^*(X) is the MINIMUM over all known function evaluations (points_sampled_value), whereas f(Xs) are GP-predicted function evaluations.

In words, we are computing the expected improvement (over the current best_so_far, best known objective function value) that would result from sampling (aka running new experiments) at points_to_sample with points_being_sampled concurrent/ongoing experiments.

In general, the EI expression is complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The idea of the MC approach is to repeatedly sample at the union of points_to_sample and points_being_sampled. This is analogous to gaussian_process_interface.sample_point_from_gp, but we sample num_union points at once: y = \mu + Lw where \mu is the GP-mean, L is the chol_factor(GP-variance) and w is a vector of num_union draws from N(0, 1). Then: improvement_per_step = max(max(best_so_far - y), 0.0) Observe that the inner max means only the smallest component of y contributes in each iteration. We compute the improvement over many random draws and average.

Parameters:force_monte_carlo (boolean) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
Returns:the expected improvement from sampling points_to_sample with points_being_sampled concurrent experiments
Return type:float64
current_point

Get the current_point (array of float64 with shape (problem_size)) at which this object is evaluating the objective function, f(x).

dim[source]

Return the number of spatial dimensions.

evaluate_at_point_list(points_to_evaluate, randomness=None, max_num_threads=4, status=None)[source]

Evaluate Expected Improvement (1,p-EI) over a specified list of points_to_evaluate.

Note

We use points_to_evaluate instead of self._points_to_sample and compute the EI at those points only. self._points_to_sample is unchanged.

Generally gradient descent is preferred but when they fail to converge this may be the only “robust” option. This function is also useful for plotting or debugging purposes (just to get a bunch of EI values).

Parameters:
  • points_to_evaluate (array of float64 with shape (num_to_evaluate, self.dim)) – points at which to compute EI
  • randomness (RandomnessSourceContainer (C++ object; e.g., from C_GP.RandomnessSourceContainer())) – RNGs used by C++ to generate initial guesses and as the source of normal random numbers when monte-carlo is used
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

EI evaluated at each of points_to_evaluate

Return type:

array of float64 with shape (points_to_evaluate.shape[0])

get_current_point()[source]

Get the current_point (array of float64 with shape (problem_size)) at which this object is evaluating the objective function, f(x).

num_being_sampled[source]

Number of points being sampled in concurrent experiments; i.e., the p in q,p-EI.

num_to_sample[source]

Number of points at which to compute/optimize EI, aka potential points to sample in future experiments; i.e., the q in q,p-EI.

problem_size[source]

Return the number of independent parameters to optimize.

set_current_point(points_to_sample)[source]

Set current_point to the specified point; ordering must match.

Parameters:points_to_sample (array of float64 with shape (problem_size)) – current_point at which to evaluate the objective function, f(x)
moe.optimal_learning.python.cpp_wrappers.expected_improvement.constant_liar_expected_improvement_optimization(ei_optimizer, num_multistarts, num_to_sample, lie_value, lie_noise_variance=0.0, randomness=None, max_num_threads=4, status=None)[source]

Heuristically solves q,0-EI using the Constant Liar policy; this wraps heuristic_expected_improvement_optimization().

Note that this optimizer only uses the analytic 1,0-EI, so it is fast.

See heuristic_expected_improvement_optimization() docs for general notes on how the heuristic optimization works. In this specific instance, we use the Constant Liar estimation policy.

Note

comments copied from ConstantLiarEstimationPolicy in gpp_heuristic_expected_improvement_optimization.hpp.

The “Constant Liar” objective function estimation policy is the simplest: it always returns the same value (Ginsbourger 2008). We call this the “lie. This object also allows users to associate a noise variance to the lie value.

In Ginsbourger’s work, the most common lie values have been the min and max of all previously observed objective function values; i.e., min, max of GP.points_sampled_value. The mean has also been considered.

He also points out that larger lie values (e.g., max of prior measurements) will lead methods like ComputeEstimatedSetOfPointsToSample() to be more explorative and vice versa.

Parameters:
  • ei_optimizer (cpp_wrappers.optimization.*Optimizer object) – object that optimizes (e.g., gradient descent, newton) EI over a domain
  • num_multistarts (int > 0) – number of times to multistart ei_optimizer (UNUSED, data is in ei_optimizer.optimizer_parameters)
  • num_to_sample (int >= 1) – how many simultaneous experiments you would like to run (i.e., the q in q,0-EI)
  • lie_value (float64) – the “constant lie” that this estimator should return
  • lie_noise_variance (float64) – the noise_variance to associate to the lie_value (MUST be >= 0.0)
  • randomness (RandomnessSourceContainer (C++ object; e.g., from C_GP.RandomnessSourceContainer())) – RNGs used by C++ to generate initial guesses
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

point(s) that approximately maximize the expected improvement (solving the q,0-EI problem)

Return type:

array of float64 with shape (num_to_sample, ei_optimizer.objective_function.dim)

moe.optimal_learning.python.cpp_wrappers.expected_improvement.kriging_believer_expected_improvement_optimization(ei_optimizer, num_multistarts, num_to_sample, std_deviation_coef=0.0, kriging_noise_variance=0.0, randomness=None, max_num_threads=4, status=None)[source]

Heuristically solves q,0-EI using the Kriging Believer policy; this wraps heuristic_expected_improvement_optimization().

Note that this optimizer only uses the analytic 1,0-EI, so it is fast.

See heuristic_expected_improvement_optimization() docs for general notes on how the heuristic optimization works. In this specific instance, we use the Kriging Believer estimation policy.

Note

comments copied from KrigingBelieverEstimationPolicy in gpp_heuristic_expected_improvement_optimization.hpp.

The “Kriging Believer” objective function estimation policy uses the Gaussian Process (i.e., the prior) to produce objective function estimates. The simplest method is to trust the GP completely: estimate = GP.mean(point) This follows the usage in Ginsbourger 2008. Users may also want the estimate to depend on the GP variance at the evaluation point, so that the estimate reflects how confident the GP is in the prediction. Users may also specify std_devation_ceof: estimate = GP.mean(point) + std_deviation_coef * GP.variance(point) Note that the coefficient is signed, and analogously to ConstantLiar, larger positive values are more explorative and larger negative values are more exploitive.

This object also allows users to associate a noise variance to the lie value.

Parameters:
  • ei_optimizer (cpp_wrappers.optimization.*Optimizer object) – object that optimizes (e.g., gradient descent, newton) EI over a domain
  • num_multistarts (int > 0) – number of times to multistart ei_optimizer (UNUSED, data is in ei_optimizer.optimizer_parameters)
  • num_to_sample (int >= 1) – how many simultaneous experiments you would like to run (i.e., the q in q,0-EI)
  • std_deviation_coef (float64) – the relative amount of bias (in units of GP std deviation) to introduce into the GP mean
  • kriging_noise_variance (float64) – the noise_variance to associate to each function value estimate (MUST be >= 0.0)
  • randomness (RandomnessSourceContainer (C++ object; e.g., from C_GP.RandomnessSourceContainer())) – RNGs used by C++ to generate initial guesses
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

point(s) that approximately maximize the expected improvement (solving the q,0-EI problem)

Return type:

array of float64 with shape (num_to_sample, ei_optimizer.objective_function.dim)

moe.optimal_learning.python.cpp_wrappers.expected_improvement.multistart_expected_improvement_optimization(ei_optimizer, num_multistarts, num_to_sample, use_gpu=False, which_gpu=0, randomness=None, max_num_threads=4, status=None)[source]

Solve the q,p-EI problem, returning the optimal set of q points to sample CONCURRENTLY in future experiments.

When points_being_sampled.size == 0 && num_to_sample == 1, this function will use (fast) analytic EI computations.

Note

The following comments are copied from gpp_math.hpp, ComputeOptimalPointsToSample(). These comments are copied into moe.optimal_learning.python.python_version.expected_improvement.multistart_expected_improvement_optimization()

This is the primary entry-point for EI optimization in the optimal_learning library. It offers our best shot at improving robustness by combining higher accuracy methods like gradient descent with fail-safes like random/grid search.

Returns the optimal set of q points to sample CONCURRENTLY by solving the q,p-EI problem. That is, we may want to run 4 experiments at the same time and maximize the EI across all 4 experiments at once while knowing of 2 ongoing experiments (4,2-EI). This function handles this use case. Evaluation of q,p-EI (and its gradient) for q > 1 or p > 1 is expensive (requires monte-carlo iteration), so this method is usually very expensive.

Compared to ComputeHeuristicPointsToSample() (gpp_heuristic_expected_improvement_optimization.hpp), this function makes no external assumptions about the underlying objective function. Instead, it utilizes a feature of the GaussianProcess that allows the GP to account for ongoing/incomplete experiments.

If num_to_sample = 1, this is the same as ComputeOptimalPointsToSampleWithRandomStarts().

The option of using GPU to compute general q,p-EI via MC simulation is also available. To enable it, make sure you have installed GPU components of MOE, otherwise, it will throw Runtime excpetion.

Parameters:
  • ei_optimizer (cpp_wrappers.optimization.*Optimizer object) – object that optimizes (e.g., gradient descent, newton) EI over a domain
  • num_multistarts (int > 0) – number of times to multistart ei_optimizer (UNUSED, data is in ei_optimizer.optimizer_parameters)
  • num_to_sample (int >= 1) – how many simultaneous experiments you would like to run (i.e., the q in q,p-EI)
  • use_gpu (bool) – set to True if user wants to use GPU for MC simulation
  • which_gpu (int >= 0) – GPU device ID
  • randomness (RandomnessSourceContainer (C++ object; e.g., from C_GP.RandomnessSourceContainer())) – RNGs used by C++ to generate initial guesses and as the source of normal random numbers when monte-carlo is used
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

point(s) that maximize the expected improvement (solving the q,p-EI problem)

Return type:

array of float64 with shape (num_to_sample, ei_optimizer.objective_function.dim)

moe.optimal_learning.python.cpp_wrappers.gaussian_process module

Implementation of GaussianProcessInterface using C++ calls.

This file contains a class to manipulate a Gaussian Process through the C++ implementation (gpp_math.hpp/cpp).

See moe.optimal_learning.python.interfaces.gaussian_process_interface for more details.

class moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess(covariance_function, historical_data)[source]

Bases: moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessInterface

Implementation of a GaussianProcess via C++ wrappers: mean, variance, gradients thereof, and data I/O.

Object that encapsulates Gaussian Process Priors (GPPs). A GPP is defined by a set of (sample point, function value, noise variance) triples along with a covariance function that relates the points. Each point has dimension dim. These are the training data; for example, each sample point might specify an experimental cohort and the corresponding function value is the objective measured for that experiment. There is one noise variance value per function value; this is the measurement error and is treated as N(0, noise_variance) Gaussian noise.

GPPs estimate a real process ms f(x) = GP(m(x), k(x,x’))me (see file docs). This class deals with building an estimator to the actual process using measurements taken from the actual process–the (sample point, function val, noise) triple. Then predictions about unknown points can be made by sampling from the GPP–in particular, finding the (predicted) mean and variance. These functions (and their gradients) are provided in ComputeMeanOfPoints, ComputeVarianceOfPoints, etc.

Further mathematical details are given in the implementation comments, but we are essentially computing:

ComputeMeanOfPoints : K(Xs, X) * [K(X,X) + \sigma_n^2 I]^{-1} * y = Ks^T * K^{-1} * y
ComputeVarianceOfPoints: K(Xs, Xs) - K(Xs,X) * [K(X,X) + \sigma_n^2 I]^{-1} * K(X,Xs) = Kss - Ks^T * K^{-1} * Ks

This (estimated) mean and variance characterize the predicted distributions of the actual ms m(x), k(x,x’)me functions that underly our GP.

The “independent variables” for this object are points_to_sample. These points are both the “p” and the “q” in q,p-EI; i.e., they are the parameters of both ongoing experiments and new predictions. Recall that in q,p-EI, the q points are called points_to_sample and the p points are called points_being_sampled. Here, we need to make predictions about both point sets with the GP, so we simply call the union of point sets points_to_sample.

In GP computations, there is really no distinction between the “q” and “p” points from EI, points_to_sample and points_being_sampled, respectively. However, in EI optimization, we only need gradients of GP quantities wrt points_to_sample, so users should call members functions with num_derivatives = num_to_sample in that context.

add_sampled_points(sampled_points)[source]

Add sampled point(s) (point, value, noise) to the GP’s prior data.

Also forces recomputation of all derived quantities for GP to remain consistent.

Parameters:sampled_points (list of SamplePoint objects (or SamplePoint-like iterables)) – moe.optimal_learning.python.SamplePoint objects to load into the GP (containing point, function value, and noise variance)
compute_cholesky_variance_of_points(points_to_sample)[source]

Compute the cholesky factorization of the variance (matrix) of this GP at each point of Xs (points_to_sample).

points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.

Parameters:points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
Returns:cholesky factorization of the variance matrix of this GP, lower triangular
Return type:array of float64 with shape (num_to_sample, num_to_sample), only lower triangle filled in
compute_grad_cholesky_variance_of_points(points_to_sample, num_derivatives=-1)[source]

Compute the gradient of the cholesky factorization of the variance (matrix) of this GP at each point of Xs (points_to_sample) wrt Xs.

points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.

This function accounts for the effect on the gradient resulting from cholesky-factoring the variance matrix. See Smith 1995 for algorithm details.

Note that grad_chol is nominally sized: grad_chol[num_to_sample][num_to_sample][num_to_sample][dim]. Let this be indexed grad_chol[k][j][i][d], which is read the derivative of var[j][i] with respect to x_{k,d} (x = points_to_sample)

Parameters:
  • points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
  • num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns:

grad_chol: gradient of the cholesky factorization of the variance matrix of this GP. grad_chol[k][j][i][d] is actually the gradients of var_{j,i} with respect to x_{k,d}, the d-th dimension of the k-th entry of points_to_sample

Return type:

array of float64 with shape (num_derivatives, num_to_sample, num_to_sample, dim)

compute_grad_mean_of_points(points_to_sample, num_derivatives=-1)[source]

Compute the gradient of the mean of this GP at each of point of Xs (points_to_sample) wrt Xs.

points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.

Note that grad_mu is nominally sized: grad_mu[num_to_sample][num_to_sample][dim]. This is the the d-th component of the derivative evaluated at the i-th input wrt the j-th input. However, for 0 <= i,j < num_to_sample, i != j, grad_mu[j][i][d] = 0. (See references or implementation for further details.) Thus, grad_mu is stored in a reduced form which only tracks the nonzero entries.

Parameters:
  • points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
  • num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns:

grad_mu: gradient of the mean of the GP. grad_mu[i][d] is actually the gradient of \mu_i wrt x_{i,d}, the d-th dim of the i-th entry of points_to_sample.

Return type:

array of float64 with shape (num_to_sample, dim)

compute_grad_variance_of_points(points_to_sample, num_derivatives=-1)[source]

Compute the gradient of the variance (matrix) of this GP at each point of Xs (points_to_sample) wrt Xs.

points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.

This function is similar to compute_grad_cholesky_variance_of_points() (below), except this does not include gradient terms from the cholesky factorization. Description will not be duplicated here.

Parameters:
  • points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
  • num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns:

grad_var: gradient of the variance matrix of this GP

Return type:

array of float64 with shape (num_derivatives, num_to_sample, num_to_sample, dim)

compute_mean_of_points(points_to_sample)[source]

Compute the mean of this GP at each of point of Xs (points_to_sample).

points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.

Parameters:points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
Returns:mean: where mean[i] is the mean at points_to_sample[i]
Return type:array of float64 with shape (num_to_sample)
compute_variance_of_points(points_to_sample)[source]

Compute the variance (matrix) of this GP at each point of Xs (points_to_sample).

points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.

The variance matrix is symmetric although we currently return the full representation.

Parameters:points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
Returns:var_star: variance matrix of this GP
Return type:array of float64 with shape (num_to_sample, num_to_sample)
dim[source]

Return the number of spatial dimensions.

get_covariance_copy()[source]

Return a copy of the covariance object specifying the Gaussian Process.

Returns:covariance object encoding assumptions about the GP’s behavior on our data
Return type:interfaces.covariance_interface.CovarianceInterface subclass
get_historical_data_copy()[source]

Return the data (points, function values, noise) specifying the prior of the Gaussian Process.

Returns:object specifying the already-sampled points, the objective value at those points, and the noise variance associated with each observation
Return type:data_containers.HistoricalData
num_sampled[source]

Return the number of sampled points.

sample_point_from_gp(point_to_sample, noise_variance=0.0)[source]

Sample a function value from a Gaussian Process prior, provided a point at which to sample.

Uses the formula function_value = gpp_mean + sqrt(gpp_variance) * w1 + sqrt(noise_variance) * w2, where w1, w2 are draws from N(0,1).

Normal RNG source is held within the C++ GaussianProcess object.

Note

Set noise_variance to 0 if you want “accurate” draws from the GP. BUT if the drawn (point, value) pair is meant to be added back into the GP (e.g., for testing), then this point MUST be drawn with noise_variance equal to the noise associated with “point” as a member of “points_sampled”

Parameters:
  • point_to_sample – point (in dim dimensions) at which to sample from this GP
  • noise_variance (float64 >= 0.0) – amount of noise to associate with the sample
Returns:

sample_value: function value drawn from this GP

Return type:

float64

moe.optimal_learning.python.cpp_wrappers.log_likelihood module

Tools to compute log likelihood-like measures of model fit and optimize them (wrt the hyperparameters of covariance).

See the file comments in moe.optimal_learning.python.interfaces.log_likelihood_interface for an overview of log likelihood-like metrics and their role in model selection. This file provides hooks to implementations of two such metrics in C++: Log Marginal Likelihood and Leave One Out Cross Validation Log Pseudo-Likelihood.

Note

This is a copy of the file comments in gpp_model_selection.hpp. These comments are copied in moe.optimal_learning.python.python_version.log_likelihood. See this file’s comments and interfaces.log_likelihood_interface for more details as well as the hpp and corresponding .cpp file.

a. LOG MARGINAL LIKELIHOOD (LML)

(Rasmussen & Williams, 5.4.1) The Log Marginal Likelihood measure comes from the ideas of Bayesian model selection, which use Bayesian inference to predict distributions over models and their parameters. The cpp file comments explore this idea in more depth. For now, we will simply state the relevant result. We can build up the notion of the “marginal likelihood”: probability(observed data GIVEN sampling points (X), model hyperparameters, model class (regression, GP, etc.)), which is denoted: p(y | X, \theta, H_i) (see the cpp file comments for more).

So the marginal likelihood deals with computing the probability that the observed data was generated from (another way: is easily explainable by) the given model.

The marginal likelihood is in part paramaterized by the model’s hyperparameters; e.g., as mentioned above. Thus we can search for the set of hyperparameters that produces the best marginal likelihood and use them in our model. Additionally, a nice property of the marginal likelihood optimization is that it automatically trades off between model complexity and data fit, producing a model that is reasonably simple while still explaining the data reasonably well. See the cpp file comments for more discussion of how/why this works.

In general, we do not want a model with perfect fit and high complexity, since this implies overfit to input noise. We also do not want a model with very low complexity and poor data fit: here we are washing the signal out with (assumed) noise, so the model is simple but it provides no insight on the data.

This is not magic. Using GPs as an example, if the covariance function is completely mis-specified, we can blindly go through with marginal likelihood optimization, obtain an “optimal” set of hyperparameters, and proceed... never realizing that our fundamental assumptions are wrong. So care is always needed.

b. LEAVE ONE OUT CROSS VALIDATION (LOO-CV)

(Rasmussen & Williams, Chp 5.4.2) In cross validation, we split the training data, X, into two sets–a sub-training set and a validation set. Then we train a model on the sub-training set and test it on the validation set. Since the validation set comes from the original training data, we can compute the error. In effect we are examining how well the model explains itself.

Leave One Out CV works by considering n different validation sets, one at a time. Each point of X takes a turn being the sole member of the validation set. Then for each validation set, we compute a log pseudo-likelihood, measuring how probable that validation set is given the remaining training data and model hyperparameters.

Again, we can maximize this quanitity over hyperparameters to help us choose the “right” set for the GP.

class moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLeaveOneOutLogLikelihood(covariance_function, historical_data)[source]

Bases: moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogLikelihood

Class for computing the Leave-One-Out Cross Validation (LOO-CV) Log Pseudo-Likelihood.

Given a particular covariance function (including hyperparameters) and training data ((point, function value, measurement noise) tuples), the log LOO-CV pseudo-likelihood expresses how well the model explains itself.

Note

This is a copy of LeaveOneOutLogLikelihoodEvaluator’s class comments in gpp_model_selection.hpp. See this file’s comments and moe.optimal_learning.python.interfaces.log_likelihood_interface for more details as well as the hpp and corresponding .cpp file.

That is, cross validation involves splitting the training set into a sub-training set and a validation set. Then we measure the log likelihood that a model built on the sub-training set could produce the values in the validation set.

Leave-One-Out CV does this process |y| times: on the i-th run, the sub-training set is (X, y) with the i-th point removed and the validation set is the i-th point. Then the predictive performance of each sub-model are aggregated into a psuedo-likelihood.

This quantity primarily deals with the internal consistency of the model–how well it explains itself. The LOO-CV likelihood gives an “estimate for the predictive probability, whether or not the assumptions of the model may be fulfilled.” It is a more frequentist view of model selection. (Rasmussen & Williams p118) See Rasmussen & Williams 5.3 and 5.4.2 for more details.

As with the log marginal likelihood, we can use this quantity to measure the performance of our model. We can also maximize it (via hyperparameter modifications or covariance function changes) to improve model performance. It has also been argued that LOO-CV is better at detecting model mis-specification (e.g., wrong covariance function) than log marginal measures (Rasmussen & Williams p118).

compute_hessian_log_likelihood(hyperparameters)[source]

The (hyperparameter) hessian of LOO-CV has not been implemented in C++ yet.

class moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogLikelihood(covariance_function, historical_data, log_likelihood_type=moe.build.GPP.LogLikelihoodTypes.log_marginal_likelihood)[source]

Bases: moe.optimal_learning.python.interfaces.log_likelihood_interface.GaussianProcessLogLikelihoodInterface, moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface

Class for computing log likelihood-like measures of model fit via C++ wrappers (currently log marginal and leave one out cross validation).

See moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogMarginalLikelihood and moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLeaveOneOutLogLikelihood classes below for some more details on these metrics. Users may find it more convenient to construct these objects instead of a GaussianProcessLogLikelihood object directly. Since these various metrics are fairly different, the member function docs in this class will remain generic.

See gpp_model_selection.hpp/cpp for further overview and in-depth discussion, respectively.

compute_grad_log_likelihood()[source]

Compute the gradient (wrt hyperparameters) of the objective_type measure at the specified hyperparameters.

Returns:grad_log_likelihood: i-th entry is \pderiv{LL(y | X, \theta)}{\theta_i}
Return type:array of float64 with shape (num_hyperparameters)
compute_grad_objective_function()

Compute the gradient (wrt hyperparameters) of the objective_type measure at the specified hyperparameters.

Returns:grad_log_likelihood: i-th entry is \pderiv{LL(y | X, \theta)}{\theta_i}
Return type:array of float64 with shape (num_hyperparameters)
compute_hessian_log_likelihood()[source]

We do not currently support computation of the (hyperparameter) hessian of log likelihood-like metrics.

compute_hessian_objective_function()

We do not currently support computation of the (hyperparameter) hessian of log likelihood-like metrics.

compute_log_likelihood()[source]

Compute the objective_type measure at the specified hyperparameters.

Returns:value of log_likelihood evaluated at hyperparameters (LL(y | X, \theta))
Return type:float64
compute_objective_function()

Compute the objective_type measure at the specified hyperparameters.

Returns:value of log_likelihood evaluated at hyperparameters (LL(y | X, \theta))
Return type:float64
current_point

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

Equivalently, get the current_point at which this object is evaluating the objective function, f(x)

dim[source]

Return the number of spatial dimensions.

get_covariance_copy()[source]

Return a copy of the covariance object specifying the Gaussian Process.

Returns:covariance object encoding assumptions about the GP’s behavior on our data
Return type:interfaces.covariance_interface.CovarianceInterface subclass
get_historical_data_copy()[source]

Return the data (points, function values, noise) specifying the prior of the Gaussian Process.

Returns:object specifying the already-sampled points, the objective value at those points, and the noise variance associated with each observation
Return type:data_containers.HistoricalData
get_hyperparameters()[source]

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

Equivalently, get the current_point at which this object is evaluating the objective function, f(x)

hyperparameters

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

Equivalently, get the current_point at which this object is evaluating the objective function, f(x)

num_hyperparameters[source]

Return the number of hyperparameters aka the number of independent parameters to optimize.

problem_size

an alias for num_hyperparameters to fulfill moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface

set_hyperparameters(hyperparameters)[source]

Set hyperparameters to the specified hyperparameters; ordering must match.

Parameters:hyperparameters (array of float64 with shape (num_hyperparameters)) – hyperparameters at which to evaluate the log likelihood (objective function), f(x)
class moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogMarginalLikelihood(covariance_function, historical_data)[source]

Bases: moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogLikelihood

Class for computing the Log Marginal Likelihood, log(p(y | X, \theta)).

That is, the probability of observing the training values, y, given the training points, X, and hyperparameters (of the covariance function), \theta.

This is a measure of how likely it is that the observed values came from our Gaussian Process Prior.

Note

This is a copy of LogMarginalLikelihoodEvaluator’s class comments in gpp_model_selection.hpp. See this file’s comments and moe.optimal_learning.python.interfaces.log_likelihood_interface for more details as well as the hpp and corresponding .cpp file.

Given a particular covariance function (including hyperparameters) and training data ((point, function value, measurement noise) tuples), the log marginal likelihood is the log probability that the data were observed from a Gaussian Process would have generated the observed function values at the given measurement points. So log marginal likelihood tells us “the probability of the observations given the assumptions of the model.” Log marginal sits well with the Bayesian Inference camp. (Rasmussen & Williams p118)

This quantity primarily deals with the trade-off between model fit and model complexity. Handling this trade-off is automatic in the log marginal likelihood calculation. See Rasmussen & Williams 5.2 and 5.4.1 for more details.

We can use the log marginal likelihood to determine how good our model is. Additionally, we can maximize it by varying hyperparameters (or even changing covariance functions) to improve our model quality. Hence this class provides access to functions for computing log marginal likelihood and its hyperparameter gradients.

moe.optimal_learning.python.cpp_wrappers.log_likelihood.evaluate_log_likelihood_at_hyperparameter_list(log_likelihood_evaluator, hyperparameters_to_evaluate, max_num_threads=4, status=None)[source]

Compute the specified log likelihood measure at each input set of hyperparameters.

Generally Newton or gradient descent is preferred but when they fail to converge this may be the only “robust” option. This function is also useful for plotting or debugging purposes (just to get a bunch of log likelihood values).

Calls into evaluate_log_likelihood_at_hyperparameter_list() in cpp/GPP_python_model_selection.cpp.

Parameters:
  • log_likelihood_evaluator (cpp_wrappers.log_likelihood.LogLikelihood) – object specifying which log likelihood measure to evaluate
  • hyperparameters_to_evaluate (array of float64 with shape (num_to_eval, log_likelihood_evaluator.num_hyperparameters)) – the hyperparameters at which to compute the specified log likelihood
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

log likelihood value at each specified set of hyperparameters

Return type:

array of float64 with shape (hyperparameters_to_evaluate.shape[0])

moe.optimal_learning.python.cpp_wrappers.log_likelihood.multistart_hyperparameter_optimization(log_likelihood_optimizer, num_multistarts, randomness=None, max_num_threads=4, status=None)[source]

Select the hyperparameters that maximize the specified log likelihood measure of model fit (over the historical data) within the specified domain.

See moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogMarginalLikelihood and moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLeaveOneOutLogLikelihood for an overview of some example log likelihood-like measures.

Optimizers are: null (‘dumb’ search), gradient descent, newton Newton is the suggested optimizer.

‘dumb’ search means this will just evaluate the objective log likelihood measure at num_multistarts ‘points’ (hyperparameters) in the domain, uniformly sampled using latin hypercube sampling. The hyperparameter_optimizer_parameters input specifies the desired optimization technique as well as parameters controlling its behavior (see moe.optimal_learning.python.cpp_wrappers.optimization).

See gpp_python_common.cpp for C++ enum declarations laying out the options for objective and optimizer types.

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline for sizing the domain and gd_parameters.num_multistarts; i.e., be wary of sets of initial guesses that cover the space too sparsely.

Note that the domain here must be specified in LOG-10 SPACE!

Solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Warning

this function fails if NO improvement can be found! In that case, the output will always be the first randomly chosen point. status will report failure.

Parameters:
  • ei_optimizer (cpp_wrappers.optimization.*Optimizer object) – object that optimizes (e.g., gradient descent, newton) log likelihood over a domain
  • num_multistarts (int > 0) – number of times to multistart ei_optimizer (UNUSED, data is in log_likelihood_optimizer.optimizer_parameters)
  • randomness (RandomnessSourceContainer (C++ object; e.g., from C_GP.RandomnessSourceContainer())) – RNGs used by C++ to generate initial guesses
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

hyperparameters that maximize the specified log likelihood measure within the specified domain

Return type:

array of float64 with shape (log_likelihood_optimizer.objective_function.num_hyperparameters)

moe.optimal_learning.python.cpp_wrappers.optimization module

Thin optimization-related containers that can be passed to cpp_wrappers.* functions/classes that perform optimization.

See cpp/gpp_optimization.hpp for more details on optimization techniques.

C++ optimization tools are templated, so it doesn’t make much sense to expose their members to Python (unless we built a C++ optimizable object that could call Python). So the classes in this file track the data necessary for C++ calls to construct the matching C++ optimization object and the appropriate optimizer parameters.

C++ expects input objects to have a certain format; the classes in this file make it convenient to put data into the expected format. Generally the C++ optimizers want to know the objective function (what), optimization method (how), domain (where, etc. along with paramters like number of iterations, tolerances, etc.

These Python classes/functions wrap the C++ structs in: gpp_optimizer_parameters.hpp.

The *OptimizerParameters structs contain the high level details–what to optimize, how to do it, etc. explicitly. And the hold a reference to a C++ struct containing parameters for the specific optimizer. The build_*_parameters() helper functions provide wrappers around these C++ objects’ constructors.

Note

the following comments in this module are copied from the header comments in gpp_optimization.hpp.

Table of Contents:

  1. FILE OVERVIEW
  2. OPTIMIZATION OF OBJECTIVE FUNCTIONS
    1. GRADIENT DESCENT
      1. OVERVIEW
      2. IMPLEMENTATION DETAILS
    2. NEWTON’S METHOD
      1. OVERVIEW
      2. IMPLEMENTATION DETAILS
    3. MULTISTART OPTIMIZATION

Read the “OVERVIEW” sections for header-style comments that describe the file contents at a high level. Read the “IMPLEMENTATION” comments for cpp-style comments that talk more about the specifics. Both types are included together here since this file contains template class declarations and template function definitions. For further implementation details, see comment blocks before each individual class/function.

1 FILE OVERVIEW

First, the functions in this file are all MAXIMIZERS. We also use the term “optima,” and unless we specifically state otherwise, “optima” and “optimization” refer to “maxima” and “maximization,” respectively. (Note that minimizing g(x) is equivalent to maximizing f(x) = -1 * g(x).)

This file contains templates for some common optimization techniques: gradient descent (GD) and Newton’s method. We provide constrained implementations (constraint via heuristics like restricting updates to 50% of the distance to the nearest wall) of these optimizers. For unconstrained, just set the domain to be huge: [-DBL_MAX, DBL_MAX].

We provide *Optimizer template classes (e.g., NewtonOptimizer) as main endpoints for doing local optimization (i.e., run the optimization method from a single initial guess). We also provide a MultistartOptimizer class for global optimization (i.e., start optimizers from each of a set of initial guesses). These are all discussed further below. These templated classes are general and can optimize any OptimizableInterface subclass.

In this way, we can make the local and global optimizers completely agonistic to the function they are optimizing.

2 OPTIMIZATION OF OBJECTIVE FUNCTIONS

2a GRADIENT DESCENT (GD)

2a, i. OVERVIEW

We use first derivative information to walk the path of steepest ascent, hopefully toward a (local) maxima of the chosen log likelihood measure. This is implemented in: GradientDescentOptimization(). This method ensures that the result lies within a specified domain.

We additionally restart gradient-descent in practice; i.e., we repeatedly take the output of a GD run and start a new GD run from that point. This lives in: GradientDescentOptimizer::Optimize().

Even with restarts, gradient descent (GD) cannot start “too far” from the solution and still successfully find it. Thus users should typically start it from multiple initial guesses and take the best one (see gpp_math and gpp_model_selection for examples). The MultistartOptimizer template class in this file provides generic multistart functionality.

Finally, we optionally apply Polyak-Ruppert averaging. This is described in more detail in the docstring for GradientDescentParameters. For functions where we only have access to gradient + noise, this averaging can lead to better answers than straight gradient descent. It amounts to averaging over the final N_{avg} steps.

Gradient descent is implemented in: GradientDescentOptimizer::Optimize() (which calls GradientDescentOptimization())

2a, ii. IMPLEMENTATION DETAILS

GD’s update is: \theta_{i+1} = \theta_i + \gamma * \nabla f(\theta_i) where \gamma controls the step-size and is chosen heuristically, often varying by problem.

The previous update leads to unconstrained optimization. To ensure that our results always stay within the specified domain, we additionally limit updates if they would move us outside the domain. For example, we could imagine only moving half the distance to the nearest boundary.

With gradient descent (GD), it is hard to know what step sizes to take. Unfortunately, far enough away from an optima, the objective could increase (but very slowly). If gradient descent takes too large of a step in a bad direction, it can easily “get lost.” At the same time, taking very small steps leads to slow performance. To help, we take the standard approach of scaling down step size with iteration number. We also allow the user to specify a maximum relative change to limit the aggressiveness of GD steps. Finally, we wrap GD in a restart loop, where we fire off another GD run from the current location unless convergence was reached.

2b. NEWTON’S METHOD

2b, i. OVERVIEW

Newton’s Method (for optimization) uses second derivative information in addition to the first derivatives used by gradient descent (GD). In higher dimensions, first derivatives => gradients and second derivatives => Hessian matrix. At each iteration, gradient descent computes the derivative and blindly takes a step (of some heuristically determined size) in that direction. Care must be taken in the step size choice to balance robustness and speed while ensuring that convergence is possible. By using second derivative (the Hessian matrix in higher dimensions), which is interpretable as information about local curvature, Newton makes better* choices about step size and direction to ensure rapid** convergence.

*, ** See “IMPLEMENTATION DETAILS” comments section for details.

Recall that Newton indiscriminately finds solutions where f'(x) = 0; the eigenvalues of the Hessian classify these x as optima, saddle points, or indeterminate. We multistart Newton (e.g., gpp_model_selection) but just take the best objective value without classifying solutions. The MultistartOptimizer template class in this file provides generic multistart functionality.

Newton is implemented here: NewtonOptimizer::Optimize() (which calls NewtonOptimization())

2b, ii. IMPLEMENTATION DETAILS

Let’s address the footnotes from the previous section (Section 2b, i paragraph 1):

* Within its region of attraction, Newton’s steps are optimal (when we have only second derivative information). Outside of this region, Newton can make very poor decisions and diverge. In general, Newton is more sensitive to its initial conditions than gradient descent, but it has the potential to be much, much faster.

** By quadratic convergence, we mean that once Newton is near enough to the solution, the log of the error will roughly halve each iteration. Numerically, we would see the “number of digits” double each iteration. Again, this only happens once Newton is “close enough.”

Newton’s Method is a root-finding technique at its base. To find a root of g(x), Newton requires an initial guess, x_0, and the ability to compute g(x) and g'(x). Then the idea is that you compute root of the line tangent to g(x_0); call this x_1. And repeat. But the core idea is to make repeated linear approximations to g(x) and proceed in a fixed-point like fashion.

As an optimization method, we are looking for roots of the gradient, f'(x_{opt}) = 0. So we require an initial guess x_0 and the ability to evaluate f'(x) and f''(x) (in higher dimensions, the gradient and Hessian of f). Thus Newton makes repeated linear approximations to f'(x) or equivalently, it locally approximates f(x) with a quadratic function, continuing iteration from the optima of that quadratic. In particular, Newton would solve the optimization problem of a quadratic program in one iteration.

Mathematically, the update formulas for gradient descent (GD) and Newton are: GD: \theta_{i+1} = \theta_i +     \gamma       * \nabla f(\theta_i) Newton: \theta_{i+1} = \theta_i - H_f^-1(\theta_i) * \nabla f(\theta_i) Note: the sign of the udpate is flipped because H is negative definite near a maxima. These update schemes are similar. In GD, \gamma is chosen heuristically. There are many ways to proceed but only so much that can be done with just gradient information; moreover the standard algorithm always proceeds in the direction of the gradient. Newton takes a much more general appraoch. Instead of a scalar \gamma, the Newton update applies H^-1 to the gradient, changing both the direction and magnitude of the step.

Unfortunately, Newton indiscriminately finds solutions where f'(x) = 0. This is not necesarily an optima! In one dimension, we can have f'(x) = 0 and f''(x) = 0, in which case the solution need not be an optima (e.g., y = x^3 at x = 0). In higher dimensions, a saddle point can also result (e.g., z = x^2 - y^2 at x,y = 0). More generally, we have an optima if the Hessian is strictly negative or positive definite; a saddle if the Hessian has both positive and negative eigenvalues, and an indeterminate case if the Hessian is singular.

2c. MULTISTART OPTIMIZATION

Above, we mentioned that gradient descent (GD), Newton, etc. have a difficult time converging if they are started “too far” from an optima. Even if convergence occurs, it will typically be very slow unless the problem is simple. Worse, in a problem with multiple optima, the methods may converge to the wrong one!

Multistarting the optimizers is one way of mitigating* this issue. Multistart involves starting a run of the specified optimizer (e.g., Newton) from each of a set of initial guesses. Then the best result is reported as the result of the whole procedure. By trying a large number of initial guesses, we potentially reduce the need for good guesses; i.e., hopefully at least one guess will be “near enough” to the global optimum. This functionality is provided in MultistartOptimizer::MultistartOptimize(...).

* As noted below in the MultistartOptimizer::MultistartOptimize() function docs, mitigate is intentional here. Multistarting is NOT GUARANTEED to find global optima. But it can increase the chances of success.

Currently we let the user specify the initial guesses. In practice, this typically means a random sampling of points. We do not (yet) make any effort to say sample more heavily from regions where “more stuff is happening” or any other heuristics.

TODO(GH-165): Improve multistart heuristics.

Finally, MultistartOptimizer::MultistartOptimize() is also used to provide ‘dumb’ search functionality (optimization by just evaluating the objective at numerous points). For sufficiently complex problems, gradient descent, Newton, etc. can have exceptionally poor convergence characteristics or run too slowly. In cases where these more advanced techniques fail, we commonly fall back to ‘dumb’ search.

class moe.optimal_learning.python.cpp_wrappers.optimization.GradientDescentOptimizer(domain, optimizable, optimizer_parameters, num_random_samples=None)[source]

Bases: moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface

Simple container for telling C++ to use Gradient Descent for optimization.

See this module’s docstring for some more information or the comments in gpp_optimization.hpp for full details on GD.

optimize(**kwargs)[source]

C++ does not expose this endpoint.

class moe.optimal_learning.python.cpp_wrappers.optimization.GradientDescentParameters(*args, **kwargs)[source]

Bases: moe.build.GPP.GradientDescentParameters, moe.optimal_learning.python.comparison.EqualityComparisonMixin

Container to hold parameters that specify the behavior of Gradient Descent in a C++-readable form.

See __init__() docstring for more information.

class moe.optimal_learning.python.cpp_wrappers.optimization.NewtonOptimizer(domain, optimizable, optimizer_parameters, num_random_samples=None)[source]

Bases: moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface

Simple container for telling C++ to use Gradient Descent for optimization.

See this module’s docstring for some more information or the comments in gpp_optimization.hpp for full details on Newton.

optimize(**kwargs)[source]

C++ does not expose this endpoint.

class moe.optimal_learning.python.cpp_wrappers.optimization.NewtonParameters(*args, **kwargs)[source]

Bases: moe.build.GPP.NewtonParameters, moe.optimal_learning.python.comparison.EqualityComparisonMixin

Container to hold parameters that specify the behavior of Newton in a C++-readable form.

See __init__() docstring for more information.

class moe.optimal_learning.python.cpp_wrappers.optimization.NullOptimizer(domain, optimizable, optimizer_parameters, num_random_samples=None)[source]

Bases: moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface

A “null” or identity optimizer: this does nothing. It is used to perform “dumb” search with MultistartOptimizer.

optimize(*args, **kwargs)[source]

Do nothing; arguments are unused.

class moe.optimal_learning.python.cpp_wrappers.optimization.NullParameters[source]

Bases: moe.optimal_learning.python.cpp_wrappers.optimization.NullParameters

Empty container for optimizers that do not require any parameters (e.g., the null optimizer).

Module contents

Implementations of the ABCs in the moe.optimal_learning.python.interfaces package using calls to the C++ library (GPP.so).

The modules in this package provide hooks into the C++ implementation of optimal_learning’s features. There are functions and classes for model selection, gaussian process construction, expected improvement optimization, etc.

See the package comments for moe.optimal_learning.python.interfaces for an overview of optimal_learning’s capabilities.