moe.optimal_learning.python.python_version package

Submodules

moe.optimal_learning.python.python_version.covariance module

Implementations of covariance functions for use with moe.optimal_learning.python.python_version.log_likelihood and moe.optimal_learning.python.python_version.gaussian_process.

This file contains implementations of CovarianceInterface. Currently, we have SquareExponential, supporting:

  • covariance
  • grad_covariance
  • hyperparameter_grad_covariance

It also contains a few utilities for computing common mathematical quantities and initialization. Note that the hessian is not yet implemented (use C++ for that feature).

Gradient (spatial and hyperparameter) functions return all derivatives at once because there is substantial shared computation. The shared results are by far the most expensive part of gradient computations; they typically involve exponentiation and are further at least partially shared with the base covariance computation. In fact, we could improve performance further by caching [certain] components of the covariance computation for use with the derivative computations.

class moe.optimal_learning.python.python_version.covariance.SquareExponential(hyperparameters)[source]

Bases: moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface

Implement the square exponential covariance function.

The function: cov(x_1, x_2) = \alpha * \exp(-1/2 * ((x_1 - x_2)^T * L * (x_1 - x_2)) ) where L is the diagonal matrix with i-th diagonal entry 1/lengths[i]/lengths[i]

This covariance object has dim+1 hyperparameters: \alpha, lengths_i

covariance(point_one, point_two)[source]

Compute the square exponential covariance function of two points, cov(point_one, point_two).

Square Exponential: cov(x_1, x_2) = \alpha * \exp(-1/2 * ((x_1 - x_2)^T * L * (x_1 - x_2)) )

Note

comments are copied from the matching method comments of moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface.

The covariance function is guaranteed to be symmetric by definition: covariance(x, y) = covariance(y, x). This function is also positive definite by definition.

Parameters:
  • point_one (array of float64 with shape (dim)) – first input, the point x
  • point_two (array of float64 with shape (dim)) – second input, the point y
Returns:

value of covariance between the input points

Return type:

float64

covariance_type = 'square_exponential'
get_hyperparameters()[source]

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

get_json_serializable_info()[source]

Create and return a covariance_info dictionary of this covariance object.

grad_covariance(point_one, point_two)[source]

Compute the gradient of self.covariance(point_one, point_two) with respect to the FIRST argument, point_one.

Gradient of Square Exponential (wrt x_1): \pderiv{cov(x_1, x_2)}{x_{1,i}} = (x_{2,i} - x_{1,i}) / L_{i}^2 * cov(x_1, x_2)

Note

comments are copied from the matching method comments of moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface.

This distinction is important for maintaining the desired symmetry. Cov(x, y) = Cov(y, x). Additionally, \pderiv{Cov(x, y)}{x} = \pderiv{Cov(y, x)}{x}. However, in general, \pderiv{Cov(x, y)}{x} != \pderiv{Cov(y, x)}{y} (NOT equal! These may differ by a negative sign)

Hence to avoid separate implementations for differentiating against first vs second argument, this function only handles differentiation against the first argument. If you need \pderiv{Cov(y, x)}{x}, just swap points x and y.

Parameters:
  • point_one (array of float64 with shape (dim)) – first input, the point x
  • point_two (array of float64 with shape (dim)) – second input, the point y
Returns:

grad_cov: i-th entry is \pderiv{cov(x_1, x_2)}{x_i}

Return type:

array of float64 with shape (dim)

hyperparameter_grad_covariance(point_one, point_two)[source]

Compute the gradient of self.covariance(point_one, point_two) with respect to its hyperparameters.

Gradient of Square Exponential (wrt hyperparameters (alpha, L)): \pderiv{cov(x_1, x_2)}{\theta_0} = cov(x_1, x_2) / \theta_0 \pderiv{cov(x_1, x_2)}{\theta_0} = [(x_{1,i} - x_{2,i}) / L_i]^2 / L_i * cov(x_1, x_2) Note: \theta_0 = \alpha and \theta_{1:d} = L_{0:d-1}

Note

comments are copied from the matching method comments of moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface.

Unlike GradCovariance(), the order of point_one and point_two is irrelevant here (since we are not differentiating against either of them). Thus the matrix of grad covariances (wrt hyperparameters) is symmetric.

Parameters:
  • point_one (array of float64 with shape (dim)) – first input, the point x
  • point_two (array of float64 with shape (dim)) – second input, the point y
Returns:

grad_hyperparameter_cov: i-th entry is \pderiv{cov(x_1, x_2)}{\theta_i}

Return type:

array of float64 with shape (num_hyperparameters)

hyperparameter_hessian_covariance(point_one, point_two)[source]

Compute the hessian of self.covariance(point_one, point_two) with respect to its hyperparameters.

TODO(GH-57): Implement Hessians in Python.

hyperparameters

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

num_hyperparameters[source]

Return the number of hyperparameters of this covariance function.

set_hyperparameters(hyperparameters)[source]

Set hyperparameters to the specified hyperparameters; ordering must match.

moe.optimal_learning.python.python_version.domain module

Various python implementations of interfaces.domain_interface.DomainInterface (e.g., TensorProduct).

These are currently used to describe domain limits for optimizers (i.e., implementations of moe.optimal_learning.python.interfaces.optimization_interface).

Each domain provides functions to:

  • Describe the set of boundary planes
  • Check whether a point is inside/outside
  • Generate random point(s) inside
  • Generate points on a fixed grid
  • Limit updates (from optimizers) so that a path stays inside the domain
class moe.optimal_learning.python.python_version.domain.TensorProductDomain(domain_bounds)[source]

Bases: moe.optimal_learning.python.interfaces.domain_interface.DomainInterface

Domain type for a tensor product domain.

A d-dimensional tensor product domain is D = [x_0_{min}, x_0_{max}] X [x_1_{min}, x_1_{max}] X ... X [x_d_{min}, x_d_{max}]

check_point_inside(point)[source]

Check if a point is inside the domain/on its boundary or outside.

Parameters:point (array of float64 with shape (dim)) – point to check
Returns:true if point is inside the domain
Return type:bool
compute_update_restricted_to_domain(max_relative_change, current_point, update_vector)[source]

Compute a new update so that CheckPointInside(current_point + new_update) is true.

Changes new_update_vector so that:
point_new = point + new_update_vector

has coordinates such that CheckPointInside(point_new) returns true. We select point_new by projecting point + update_vector to the nearest point on the domain.

new_update_vector is a function of update_vector. new_update_vector is just a copy of update_vector if current_point is already inside the domain.

Note

We modify update_vector (instead of returning point_new) so that further update limiting/testing may be performed.

Parameters:
  • max_relative_change (float64 in (0, 1]) – max change allowed per update (as a relative fraction of current distance to boundary)
  • current_point (array of float64 with shape (dim)) – starting point
  • update_vector (array of float64 with shape (dim)) – proposed update
Returns:

new update so that the final point remains inside the domain

Return type:

array of float64 with shape (dim)

dim[source]

Return the number of spatial dimensions.

domain_type = 'tensor_product'
generate_grid_points_in_domain(points_per_dimension, random_source=None)[source]

Generate a grid of N_0 by N_1 by ... by N_{dim-1} points, with each dimension uniformly spaced along the domain boundary.

See python.geometry_utils.generate_grid_points for more details.

Parameters:
  • points_per_dimension (tuple or scalar) – (n_1, n_2, ... n_{dim}) number of stencil points per spatial dimension. If len(points_per_dimension) == 1, then n_i = len(points_per_dimension)
  • random_source (callable yielding uniform random numbers in [0,1]) – random source producing uniform random numbers (e.g., numpy.random.uniform) (UNUSED)
Returns:

uniform random sampling of points from the domain

generate_random_point_in_domain(random_source=None)[source]

Generate point uniformly at random such that self.check_point_inside(point) is True.

Note

if you need multiple points, use generate_uniform_random_points_in_domain instead; it yields better distributions over many points (via latin hypercube samplling) b/c it guarantees that no non-uniform clusters may arise (in subspaces) versus this method which treats all draws independently.

Returns:point in domain
Return type:array of float64 with shape (dim)
generate_uniform_random_points_in_domain(num_points, random_source=None)[source]

Generate num_points on a latin-hypercube (i.e., like a checkerboard).

See python.geometry_utils.generate_latin_hypercube_points for more details.

Parameters:
  • num_points (int >= 0) – max number of points to generate
  • random_source (callable yielding uniform random numbers in [0,1]) – random source producing uniform random numbers (e.g., numpy.random.uniform) (UNUSED)
Returns:

uniform random sampling of points from the domain

Return type:

array of float64 with shape (num_points, dim)

get_bounding_box()[source]

Return a list of ClosedIntervals representing a bounding box for this domain.

get_constraint_list(start_index=0)[source]

Return a list of lambda functions expressing the domain bounds as linear constraints. Used by COBYLA.

Since COBYLA in scipy only optimizes arrays, we flatten out our points while doing multipoint EI optimization. But in order for the constraints to access the correct index, the RepeatedDomain class has to signal which index the TensorProductDomain should start from, using the start_index optional parameter.

That is, RepeatedDomain deals with N d-dimensional points at once. Thus we need N*d constraints (one per dimension, once per repeat). Additionally, instead of receiving points with shape (num_repeats, dim), COBYLA requires that the points are flattened: (num_repeats*dim, ). Thus this method must know where in the flattened-list it is writing to and reading from: signaled via start_index.

Parameters:start_index (int >= 0) – the dimension this tensor product domain should start indexing from
Returns:a list of lambda functions corresponding to constraints
Return type:array of lambda functions with shape (dim * 2)
get_json_serializable_info(minimal=False)[source]

Create and return a domain_info dictionary of this domain object.

Parameters:minimal (bool) – True for all domain contents; False for domain_type and dim only
Returns:dict representation of this domain
Return type:dict

moe.optimal_learning.python.python_version.expected_improvement module

Classes (Python) to compute the Expected Improvement, including monte carlo and analytic (where applicable) implementations.

See moe.optimal_learning.python.interfaces.expected_improvement_interface or gpp_math.hpp/cpp for further details on expected improvement.

class moe.optimal_learning.python.python_version.expected_improvement.ExpectedImprovement(gaussian_process, points_to_sample=None, points_being_sampled=None, num_mc_iterations=10000, randomness=None, mvndst_parameters=None)[source]

Bases: moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImprovementInterface, moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface

Implementation of Expected Improvement computation in Python: EI and its gradient at specified point(s) sampled from a GaussianProcess.

A class to encapsulate the computation of expected improvement and its spatial gradient using points sampled from an associated GaussianProcess. The general EI computation requires monte-carlo integration; it can support q,p-EI optimization. It is designed to work with any GaussianProcess.

When available, fast, analytic formulas replace monte-carlo loops.

Note

Equivalent methods of ExpectedImprovementInterface and OptimizableInterface are aliased below (e.g., compute_expected_improvement and compute_objective_function, etc).

See moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImprovementInterface for further details.

compute_expected_improvement(force_monte_carlo=False, force_1d_ei=False)[source]

Compute the expected improvement at points_to_sample, with points_being_sampled concurrent points being sampled.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+], where Xs are potential points to sample (union of points_to_sample and points_being_sampled) and X are already sampled points. The ^+ indicates that the expression in the expectation evaluates to 0 if it is negative. f^*(X) is the MINIMUM over all known function evaluations (points_sampled_value), whereas f(Xs) are GP-predicted function evaluations.

In words, we are computing the expected improvement (over the current best_so_far, best known objective function value) that would result from sampling (aka running new experiments) at points_to_sample with points_being_sampled concurrent/ongoing experiments.

In general, the EI expression is complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The idea of the MC approach is to repeatedly sample at the union of points_to_sample and points_being_sampled. This is analogous to gaussian_process_interface.sample_point_from_gp, but we sample num_union points at once: y = \mu + Lw where \mu is the GP-mean, L is the chol_factor(GP-variance) and w is a vector of num_union draws from N(0, 1). Then: improvement_per_step = max(max(best_so_far - y), 0.0) Observe that the inner max means only the smallest component of y contributes in each iteration. We compute the improvement over many random draws and average.

Parameters:
  • force_monte_carlo (bool) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
  • force_1d_ei (bool) – whether to force using the 1EI method. Used for testing purposes only. Takes precedence when force_monte_carlo is also True
Returns:

the expected improvement from sampling points_to_sample with points_being_sampled concurrent experiments

Return type:

float64

compute_grad_expected_improvement(force_monte_carlo=False)[source]

Compute the gradient of expected improvement at points_to_sample wrt points_to_sample, with points_being_sampled concurrent samples.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

In general, the expressions for gradients of EI are complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The MC computation of grad EI is similar to the computation of EI (decsribed in compute_expected_improvement). We differentiate y = \mu + Lw wrt points_to_sample; only terms from the gradient of \mu and L contribute. In EI, we computed: improvement_per_step = max(max(best_so_far - y), 0.0) and noted that only the smallest component of y may contribute (if it is > 0.0). Call this index winner. Thus in computing grad EI, we only add gradient terms that are attributable to the winner-th component of y.

Parameters:force_monte_carlo (boolean) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
Returns:gradient of EI, \pderiv{EI(Xq \cup Xp)}{Xq_{i,d}} where Xq is points_to_sample and Xp is points_being_sampled (grad EI from sampling points_to_sample with points_being_sampled concurrent experiments wrt each dimension of the points in points_to_sample)
Return type:array of float64 with shape (num_to_sample, dim)
compute_grad_objective_function(force_monte_carlo=False)

Compute the gradient of expected improvement at points_to_sample wrt points_to_sample, with points_being_sampled concurrent samples.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

In general, the expressions for gradients of EI are complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The MC computation of grad EI is similar to the computation of EI (decsribed in compute_expected_improvement). We differentiate y = \mu + Lw wrt points_to_sample; only terms from the gradient of \mu and L contribute. In EI, we computed: improvement_per_step = max(max(best_so_far - y), 0.0) and noted that only the smallest component of y may contribute (if it is > 0.0). Call this index winner. Thus in computing grad EI, we only add gradient terms that are attributable to the winner-th component of y.

Parameters:force_monte_carlo (boolean) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
Returns:gradient of EI, \pderiv{EI(Xq \cup Xp)}{Xq_{i,d}} where Xq is points_to_sample and Xp is points_being_sampled (grad EI from sampling points_to_sample with points_being_sampled concurrent experiments wrt each dimension of the points in points_to_sample)
Return type:array of float64 with shape (num_to_sample, dim)
compute_hessian_objective_function(**kwargs)[source]

We do not currently support computation of the (spatial) hessian of Expected Improvement.

compute_objective_function(force_monte_carlo=False, force_1d_ei=False)

Compute the expected improvement at points_to_sample, with points_being_sampled concurrent points being sampled.

points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.

Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+], where Xs are potential points to sample (union of points_to_sample and points_being_sampled) and X are already sampled points. The ^+ indicates that the expression in the expectation evaluates to 0 if it is negative. f^*(X) is the MINIMUM over all known function evaluations (points_sampled_value), whereas f(Xs) are GP-predicted function evaluations.

In words, we are computing the expected improvement (over the current best_so_far, best known objective function value) that would result from sampling (aka running new experiments) at points_to_sample with points_being_sampled concurrent/ongoing experiments.

In general, the EI expression is complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.

The idea of the MC approach is to repeatedly sample at the union of points_to_sample and points_being_sampled. This is analogous to gaussian_process_interface.sample_point_from_gp, but we sample num_union points at once: y = \mu + Lw where \mu is the GP-mean, L is the chol_factor(GP-variance) and w is a vector of num_union draws from N(0, 1). Then: improvement_per_step = max(max(best_so_far - y), 0.0) Observe that the inner max means only the smallest component of y contributes in each iteration. We compute the improvement over many random draws and average.

Parameters:
  • force_monte_carlo (bool) – whether to force monte carlo evaluation (vs using fast/accurate analytic eval when possible)
  • force_1d_ei (bool) – whether to force using the 1EI method. Used for testing purposes only. Takes precedence when force_monte_carlo is also True
Returns:

the expected improvement from sampling points_to_sample with points_being_sampled concurrent experiments

Return type:

float64

current_point

Get the current_point (array of float64 with shape (problem_size)) at which this object is evaluating the objective function, f(x).

dim[source]

Return the number of spatial dimensions.

evaluate_at_point_list(points_to_evaluate, randomness=None, max_num_threads=4, status=None)[source]

Evaluate Expected Improvement (q,p-EI) over a specified list of points_to_evaluate.

Note

We use points_to_evaluate instead of self._points_to_sample and compute the EI at those points only. self._points_to_sample will be changed.

Generally gradient descent is preferred but when it fails to converge this may be the only “robust” option. This function is also useful for plotting or debugging purposes (just to get a bunch of EI values).

TODO(GH-56): Allow callers to pass in a source of randomness.

Parameters:
  • ei_evaluator (interfaces.expected_improvement_interface.ExpectedImprovementInterface subclass) – object specifying how to evaluate the expected improvement
  • points_to_evaluate (array of float64 with shape (num_to_evaluate, num_to_sample, ei_evaluator.dim)) – points at which to compute EI
  • randomness ((UNUSED)) – random source(s) used for monte-carlo integration (when applicable) (UNUSED)
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1 (UNUSED)
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

EI evaluated at each of points_to_evaluate

Return type:

array of float64 with shape (points_to_evaluate.shape[0])

get_current_point()[source]

Get the current_point (array of float64 with shape (problem_size)) at which this object is evaluating the objective function, f(x).

num_being_sampled[source]

Number of points being sampled in concurrent experiments; i.e., the p in q,p-EI.

num_to_sample[source]

Number of points at which to compute/optimize EI, aka potential points to sample in future experiments; i.e., the q in q,p-EI.

problem_size[source]

Return the number of independent parameters to optimize.

set_current_point(points_to_sample)[source]

Set current_point to the specified point; ordering must match.

Parameters:points_to_sample (array of float64 with shape (problem_size)) – current_point at which to evaluate the objective function, f(x)
moe.optimal_learning.python.python_version.expected_improvement.MINIMUM_VARIANCE_EI = 2.2250738585072014e-308

Minimum allowed variance value in the “1D” analytic EI computation. Values that are too small result in problems b/c we may compute std_dev/var (which is enormous if std_dev = 1.0e-150 and var = 1.0e-300) since this only arises when we fail to compute std_dev = var = 0.0. Note: this is only relevant if noise = 0.0; this minimum will not affect EI computation with noise since this value is below the smallest amount of noise users can meaningfully add. This is the smallest possible value that prevents the denominator (best_so_far - mean) / sqrt(variance) from being 0. 1D analytic EI is simple and no other robustness considerations are needed.

moe.optimal_learning.python.python_version.expected_improvement.MINIMUM_VARIANCE_GRAD_EI = 7.3955709864469857e-30

Minimum allowed variance value in the “1D” analytic grad EI computation. See moe.optimal_learning.python.python_version.expected_improvement.MINIMUM_VARIANCE_EI for more details. This value was chosen so its sqrt would be a little larger than GaussianProcess::kMinimumStdDev (by ~12x). The 150.0 was determined by numerical experiment with the setup in test_1d_analytic_ei_edge_cases() in order to find a setting that would be robust (no 0/0) while introducing minimal error.

class moe.optimal_learning.python.python_version.expected_improvement.MVNDSTParameters[source]

Bases: moe.optimal_learning.python.python_version.expected_improvement._BaseMVNDSTParameters

Container to hold parameters that specify the behavior of mvndst, which qEI uses to calculate EI.

For more information about these parameters, consult: http://www.math.wsu.edu/faculty/genz/software/fort77/mvndstpack.f

Note

The actual accuracy used in mvndst is MAX(abseps, FINEST * releps), where FINEST is the estimate of the cdf integral. Because of this, it is almost always the case that abseps should be set to 0 for releps to be used.

Variables:
  • releps – (float > 0.0) relative accuracy at which to calculate the cdf of the multivariate gaussian (suggest: 1.0e-9)
  • abseps – (float > 0.0) absolute accuracy at which to calculate the cdf of the multivariate gaussian (suggest: 1.0e-9)
  • maxpts_per_dim – (int > 0) the maximum number of iterations mvndst will do is num_dimensions * maxpts_per_dim (suggest: 20000)
moe.optimal_learning.python.python_version.expected_improvement.multistart_expected_improvement_optimization(ei_optimizer, num_multistarts, num_to_sample, randomness=None, max_num_threads=4, status=None)[source]

Solve the q,p-EI problem, returning the optimal set of q points to sample CONCURRENTLY in future experiments.

When points_being_sampled.shape[0] == 0 && num_to_sample == 1, this function will use (fast) analytic EI computations.

This is the primary entry-point for EI optimization in the optimal_learning library. It offers our best shot at improving robustness by combining higher accuracy methods like gradient descent with fail-safes like random/grid search.

Returns the optimal set of q points to sample CONCURRENTLY by solving the q,p-EI problem. That is, we may want to run 4 experiments at the same time and maximize the EI across all 4 experiments at once while knowing of 2 ongoing experiments (4,2-EI). This function handles this use case. Evaluation of q,p-EI (and its gradient) for q > 1 or p > 1 is expensive (requires monte-carlo iteration), so this method is usually very expensive.

Compared to ComputeHeuristicPointsToSample() (gpp_heuristic_expected_improvement_optimization.hpp), this function makes no external assumptions about the underlying objective function. Instead, it utilizes the Expected (Parallel) Improvement, allowing the GP to account for ongoing/incomplete experiments.

If num_to_sample = 1, this is the same as ComputeOptimalPointsToSampleWithRandomStarts().

TODO(GH-56): Allow callers to pass in a source of randomness.

Parameters:
  • ei_optimizer (interfaces.optimization_interfaces.OptimizerInterface subclass) – object that optimizes (e.g., gradient descent, newton) EI over a domain
  • num_multistarts (int > 0) – number of times to multistart ei_optimizer
  • num_to_sample (int >= 1) – how many simultaneous experiments you would like to run (i.e., the q in q,p-EI) (UNUSED, specify through ei_optimizer)
  • randomness ((UNUSED)) – random source(s) used to generate multistart points and perform monte-carlo integration (when applicable) (UNUSED)
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1 (UNUSED)
  • status (dict) – (output) status messages from C++ (e.g., reporting on optimizer success, etc.)
Returns:

point(s) that maximize the expected improvement (solving the q,p-EI problem)

Return type:

array of float64 with shape (num_to_sample, ei_evaluator.dim)

moe.optimal_learning.python.python_version.gaussian_process module

Implementation (Python) of GaussianProcessInterface.

This file contains a class to manipulate a Gaussian Process through numpy/scipy.

See moe.optimal_learning.python.interfaces.gaussian_process_interface for more details.

class moe.optimal_learning.python.python_version.gaussian_process.GaussianProcess(covariance_function, historical_data)[source]

Bases: moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessInterface

Implementation of a GaussianProcess strictly in Python.

Note

Comments in this class are copied from this object’s superclass in moe.optimal_learning.python.interfaces.gaussian_process_interface.

Object that encapsulates Gaussian Process Priors (GPPs). A GPP is defined by a set of (sample point, function value, noise variance) triples along with a covariance function that relates the points. Each point has dimension dim. These are the training data; for example, each sample point might specify an experimental cohort and the corresponding function value is the objective measured for that experiment. There is one noise variance value per function value; this is the measurement error and is treated as N(0, noise_variance) Gaussian noise.

GPPs estimate a real process ms f(x) = GP(m(x), k(x,x’))me (see file docs). This class deals with building an estimator to the actual process using measurements taken from the actual process–the (sample point, function val, noise) triple. Then predictions about unknown points can be made by sampling from the GPP–in particular, finding the (predicted) mean and variance. These functions (and their gradients) are provided in ComputeMeanOfPoints, ComputeVarianceOfPoints, etc.

Further mathematical details are given in the implementation comments, but we are essentially computing:

ComputeMeanOfPoints : K(Xs, X) * [K(X,X) + \sigma_n^2 I]^{-1} * y = Ks^T * K^{-1} * y
ComputeVarianceOfPoints: K(Xs, Xs) - K(Xs,X) * [K(X,X) + \sigma_n^2 I]^{-1} * K(X,Xs) = Kss - Ks^T * K^{-1} * Ks

This (estimated) mean and variance characterize the predicted distributions of the actual ms m(x), k(x,x’)me functions that underly our GP.

The “independent variables” for this object are points_to_sample. These points are both the “p” and the “q” in q,p-EI; i.e., they are the parameters of both ongoing experiments and new predictions. Recall that in q,p-EI, the q points are called points_to_sample and the p points are called points_being_sampled. Here, we need to make predictions about both point sets with the GP, so we simply call the union of point sets points_to_sample.

In GP computations, there is really no distinction between the “q” and “p” points from EI, points_to_sample and points_being_sampled, respectively. However, in EI optimization, we only need gradients of GP quantities wrt points_to_sample, so users should call members functions with num_derivatives = num_to_sample in that context.

add_sampled_points(sampled_points, validate=False)[source]

Add sampled point(s) (point, value, noise) to the GP’s prior data.

Also forces recomputation of all derived quantities for GP to remain consistent.

Parameters:
  • sampled_points (list of moe.optimal_learning.python.SamplePoint objects) – SamplePoint objects to load into the GP (containing point, function value, and noise variance)
  • validate (boolean) – whether to sanity-check the input sample_points
compute_cholesky_variance_of_points(points_to_sample)[source]

Compute the cholesky factorization of the variance (matrix) of this GP at each point of Xs (points_to_sample).

Warning

points_to_sample should not contain duplicate points.

Parameters:points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
Returns:cholesky factorization of the variance matrix of this GP, lower triangular
Return type:array of float64 with shape (num_to_sample, num_to_sample), lower triangle filled in
compute_grad_cholesky_variance_of_points(points_to_sample, chol_var=None, num_derivatives=-1)[source]

Compute the gradient of the cholesky factorization of the variance (matrix) of this GP at each point of Xs (points_to_sample) wrt Xs.

Warning

points_to_sample should not contain duplicate points.

This function accounts for the effect on the gradient resulting from cholesky-factoring the variance matrix. See Smith 1995 for algorithm details.

Observe that grad_chol is nominally sized: grad_chol[num_to_sample][num_to_sample][num_to_sample][dim]. Let this be indexed grad_chol[k][j][i][d], which is read the derivative of var[j][i] with respect to x_{k,d} (x = points_to_sample)

Parameters:
  • points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
  • chol_var (array of float64 with shape (num_to_sample, num_to_sample)) – the cholesky factorization (L) of the variance matrix; only the lower triangle is accessed
  • num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns:

grad_chol: gradient of the cholesky factorization of the variance matrix of this GP. grad_chol[k][j][i][d] is actually the gradients of var_{j,i} with respect to x_{k,d}, the d-th dimension of the k-th entry of points_to_sample, where k = var_of_grad

Return type:

array of float64 with shape (num_derivatives, num_to_sample, num_to_sample, dim)

compute_grad_mean_of_points(points_to_sample, num_derivatives=-1)[source]

Compute the gradient of the mean of this GP at each of point of Xs (points_to_sample) wrt Xs.

Warning

points_to_sample should not contain duplicate points.

Observe that grad_mu is nominally sized: grad_mu[num_to_sample][num_to_sample][dim]. This is the the d-th component of the derivative evaluated at the i-th input wrt the j-th input. However, for 0 <= i,j < num_to_sample, i != j, grad_mu[j][i][d] = 0. (See references or implementation for further details.) Thus, grad_mu is stored in a reduced form which only tracks the nonzero entries.

Parameters:
  • points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
  • num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns:

grad_mu: gradient of the mean of the GP. grad_mu[i][d] is actually the gradient of \mu_i wrt x_{i,d}, the d-th dim of the i-th entry of points_to_sample.

Return type:

array of float64 with shape (num_derivatives, dim)

compute_grad_variance_of_points(points_to_sample, num_derivatives=-1)[source]

Compute the gradient of the variance (matrix) of this GP at each point of Xs (points_to_sample) wrt Xs.

Warning

points_to_sample should not contain duplicate points.

This function is similar to compute_grad_cholesky_variance_of_points() (below), except this does not include gradient terms from the cholesky factorization. Description will not be duplicated here.

Parameters:
  • points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
  • num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns:

grad_var: gradient of the variance matrix of this GP

Return type:

array of float64 with shape (num_derivatives, num_to_sample, num_to_sample, dim)

compute_mean_of_points(points_to_sample)[source]

Compute the mean of this GP at each of point of Xs (points_to_sample).

Warning

points_to_sample should not contain duplicate points.

Parameters:points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
Returns:mean: where mean[i] is the mean at points_to_sample[i]
Return type:array of float64 with shape (num_to_sample)
compute_variance_of_points(points_to_sample)[source]

Compute the variance (matrix) of this GP at each point of Xs (points_to_sample).

Warning

points_to_sample should not contain duplicate points.

The variance matrix is symmetric although we currently return the full representation.

Parameters:points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
Returns:var_star: variance matrix of this GP
Return type:array of float64 with shape (num_to_sample, num_to_sample)
dim[source]

Return the number of spatial dimensions.

get_covariance_copy()[source]

Return a copy of the covariance object specifying the Gaussian Process.

Returns:covariance object encoding assumptions about the GP’s behavior on our data
Return type:interfaces.covariance_interface.CovarianceInterface subclass
get_historical_data_copy()[source]

Return the data (points, function values, noise) specifying the prior of the Gaussian Process.

Returns:object specifying the already-sampled points, the objective value at those points, and the noise variance associated with each observation
Return type:data_containers.HistoricalData
num_sampled[source]

Return the number of sampled points.

sample_point_from_gp(point_to_sample, noise_variance=0.0)[source]

Sample a function value from a Gaussian Process prior, provided a point at which to sample.

Uses the formula function_value = gpp_mean + sqrt(gpp_variance) * w1 + sqrt(noise_variance) * w2, where w1, w2 are draws from N(0,1).

Note

Set noise_variance to 0 if you want “accurate” draws from the GP. BUT if the drawn (point, value) pair is meant to be added back into the GP (e.g., for testing), then this point MUST be drawn with noise_variance equal to the noise associated with “point” as a member of “points_sampled”

Parameters:
  • point_to_sample – point (in dim dimensions) at which to sample from this GP
  • noise_variance (float64 >= 0.0) – amount of noise to associate with the sample
Returns:

sample_value: function value drawn from this GP

Return type:

float64

moe.optimal_learning.python.python_version.gaussian_process.MINIMUM_STD_DEV_GRAD_CHOLESKY = 2.2204460492503131e-16

Minimum allowed standard deviation value in ComputeGradCholeskyVarianceOfPointsPerPoint (= machine precision). Values that are too small result in problems b/c we may compute std_dev/var (which is enormous if std_dev = 1.0e-150 and var = 1.0e-300) since this only arises when we fail to compute std_dev = var = 0.0. Note: this is only relevant if noise = 0.0; this minimum will not affect GPs with noise since this value is below the smallest amount of noise users can meaningfully add. This value was chosen to be consistent with the singularity condition in scipy.linalg.cho_factor and tested for robustness with the setup in test_1d_analytic_ei_edge_cases().

moe.optimal_learning.python.python_version.log_likelihood module

Tools to compute log likelihood-like measures of model fit and optimize them (wrt the hyperparameters of covariance) to select the best model for a given set of historical data.

See the file comments in moe.optimal_learning.python.interfaces.log_likelihood_interface for an overview of log likelihood-like metrics and their role in model selection. This file provides an implementation of the Log Marginal Likelihood.

Note

This is a copy of the file comments in moe.optimal_learning.python.cpp_wrappers.log_likelihood.

LOG MARGINAL LIKELIHOOD (LML)

(Rasmussen & Williams, 5.4.1) The Log Marginal Likelihood measure comes from the ideas of Bayesian model selection, which use Bayesian inference to predict distributions over models and their parameters. The cpp file comments explore this idea in more depth. For now, we will simply state the relevant result. We can build up the notion of the “marginal likelihood”: probability(observed data GIVEN sampling points (X), model hyperparameters, model class (regression, GP, etc.)), which is denoted: p(y | X, \theta, H_i) (see the cpp file comments for more).

So the marginal likelihood deals with computing the probability that the observed data was generated from (another way: is easily explainable by) the given model.

The marginal likelihood is in part paramaterized by the model’s hyperparameters; e.g., as mentioned above. Thus we can search for the set of hyperparameters that produces the best marginal likelihood and use them in our model. Additionally, a nice property of the marginal likelihood optimization is that it automatically trades off between model complexity and data fit, producing a model that is reasonably simple while still explaining the data reasonably well. See the cpp file comments for more discussion of how/why this works.

In general, we do not want a model with perfect fit and high complexity, since this implies overfit to input noise. We also do not want a model with very low complexity and poor data fit: here we are washing the signal out with (assumed) noise, so the model is simple but it provides no insight on the data.

This is not magic. Using GPs as an example, if the covariance function is completely mis-specified, we can blindly go through with marginal likelihood optimization, obtain an “optimal” set of hyperparameters, and proceed... never realizing that our fundamental assumptions are wrong. So care is always needed.

class moe.optimal_learning.python.python_version.log_likelihood.GaussianProcessLogMarginalLikelihood(covariance_function, historical_data)[source]

Bases: moe.optimal_learning.python.interfaces.log_likelihood_interface.GaussianProcessLogLikelihoodInterface, moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface

Class for computing the Log Marginal Likelihood, log(p(y | X, \theta)).

That is, the probability of observing the training values, y, given the training points, X, and hyperparameters (of the covariance function), \theta.

This is a measure of how likely it is that the observed values came from our Gaussian Process Prior.

Given a particular covariance function (including hyperparameters) and training data ((point, function value, measurement noise) tuples), the log marginal likelihood is the log probability that the data were observed from a Gaussian Process would have generated the observed function values at the given measurement points. So log marginal likelihood tells us “the probability of the observations given the assumptions of the model.” Log marginal sits well with the Bayesian Inference camp. (Rasmussen & Williams p118)

This quantity primarily deals with the trade-off between model fit and model complexity. Handling this trade-off is automatic in the log marginal likelihood calculation. See Rasmussen & Williams 5.2 and 5.4.1 for more details.

We can use the log marginal likelihood to determine how good our model is. Additionally, we can maximize it by varying hyperparameters (or even changing covariance functions) to improve our model quality. Hence this class provides access to functions for computing log marginal likelihood and its hyperparameter gradients.

Note

Equivalent methods of moe.optimal_learning.python.interfaces.log_likelihood_interface.GaussianProcessLogLikelihoodInterface and moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface are aliased below (e.g., problem_size and num_hyperparameters, compute_log_likelihood and compute_objective_function, etc).

compute_grad_log_likelihood()[source]

Compute the gradient (wrt hyperparameters) of the _log_likelihood_type measure at the specified hyperparameters.

Note

These comments are copied from LogMarginalLikelihoodEvaluator::ComputeGradLogLikelihood in gpp_model_selection.cpp.

Computes \pderiv{log(p(y | X, \theta))}{\theta_k} = \frac{1}{2} * y_i * \pderiv{K_{ij}}{\theta_k} * y_j - \frac{1}{2} * trace(K^{-1}_{ij}\pderiv{K_{ij}}{\theta_k}) Or equivalently, = \frac{1}{2} * trace([\alpha_i \alpha_j - K^{-1}_{ij}]*\pderiv{K_{ij}}{\theta_k}), where \alpha_i = K^{-1}_{ij} * y_j

Returns:grad_log_likelihood: i-th entry is \pderiv{LL(y | X, \theta)}{\theta_i}
Return type:array of float64 with shape (num_hyperparameters)
compute_grad_objective_function()

Compute the gradient (wrt hyperparameters) of the _log_likelihood_type measure at the specified hyperparameters.

Note

These comments are copied from LogMarginalLikelihoodEvaluator::ComputeGradLogLikelihood in gpp_model_selection.cpp.

Computes \pderiv{log(p(y | X, \theta))}{\theta_k} = \frac{1}{2} * y_i * \pderiv{K_{ij}}{\theta_k} * y_j - \frac{1}{2} * trace(K^{-1}_{ij}\pderiv{K_{ij}}{\theta_k}) Or equivalently, = \frac{1}{2} * trace([\alpha_i \alpha_j - K^{-1}_{ij}]*\pderiv{K_{ij}}{\theta_k}), where \alpha_i = K^{-1}_{ij} * y_j

Returns:grad_log_likelihood: i-th entry is \pderiv{LL(y | X, \theta)}{\theta_i}
Return type:array of float64 with shape (num_hyperparameters)
compute_hessian_log_likelihood()[source]

We do not currently support computation of the (hyperparameter) hessian of log likelihood-like metrics.

compute_hessian_objective_function()

We do not currently support computation of the (hyperparameter) hessian of log likelihood-like metrics.

compute_log_likelihood()[source]

Compute the _log_likelihood_type measure at the specified hyperparameters.

Note

These comments are copied from LogMarginalLikelihoodEvaluator::ComputeLogLikelihood in gpp_model_selection.cpp.

log p(y | X, \theta) = -\frac{1}{2} * y^T * K^-1 * y - \frac{1}{2} * \log(det(K)) - \frac{n}{2} * \log(2*pi) where n is num_sampled, \theta are the hyperparameters, and \log is the natural logarithm. In the following, term1 = -\frac{1}{2} * y^T * K^-1 * y term2 = -\frac{1}{2} * \log(det(K)) term3 = -\frac{n}{2} * \log(2*pi)

For an SPD matrix K = L * L^T, det(K) = \Pi_i L_ii^2 We could compute this directly and then take a logarithm. But we also know: \log(det(K)) = 2 * \sum_i \log(L_ii) The latter method is (currently) preferred for computing \log(det(K)) due to reduced chance for overflow and (possibly) better numerical conditioning.

Returns:value of log_likelihood evaluated at hyperparameters (LL(y | X, \theta))
Return type:float64
compute_objective_function()

Compute the _log_likelihood_type measure at the specified hyperparameters.

Note

These comments are copied from LogMarginalLikelihoodEvaluator::ComputeLogLikelihood in gpp_model_selection.cpp.

log p(y | X, \theta) = -\frac{1}{2} * y^T * K^-1 * y - \frac{1}{2} * \log(det(K)) - \frac{n}{2} * \log(2*pi) where n is num_sampled, \theta are the hyperparameters, and \log is the natural logarithm. In the following, term1 = -\frac{1}{2} * y^T * K^-1 * y term2 = -\frac{1}{2} * \log(det(K)) term3 = -\frac{n}{2} * \log(2*pi)

For an SPD matrix K = L * L^T, det(K) = \Pi_i L_ii^2 We could compute this directly and then take a logarithm. But we also know: \log(det(K)) = 2 * \sum_i \log(L_ii) The latter method is (currently) preferred for computing \log(det(K)) due to reduced chance for overflow and (possibly) better numerical conditioning.

Returns:value of log_likelihood evaluated at hyperparameters (LL(y | X, \theta))
Return type:float64
current_point

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

Equivalently, get the current_point at which this object is evaluating the objective function, f(x)

dim[source]

Return the number of spatial dimensions.

get_covariance_copy()[source]

Return a copy of the covariance object specifying the Gaussian Process.

Returns:covariance object encoding assumptions about the GP’s behavior on our data
Return type:interfaces.covariance_interface.CovarianceInterface subclass
get_historical_data_copy()[source]

Return the data (points, function values, noise) specifying the prior of the Gaussian Process.

Returns:object specifying the already-sampled points, the objective value at those points, and the noise variance associated with each observation
Return type:data_containers.HistoricalData
get_hyperparameters()[source]

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

Equivalently, get the current_point at which this object is evaluating the objective function, f(x)

hyperparameters

Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.

Equivalently, get the current_point at which this object is evaluating the objective function, f(x)

num_hyperparameters[source]

Return the number of hyperparameters aka the number of independent parameters to optimize.

problem_size

Return the number of hyperparameters aka the number of independent parameters to optimize.

set_hyperparameters(hyperparameters)[source]

Set hyperparameters to the specified hyperparameters; ordering must match.

Parameters:hyperparameters (array of float64 with shape (num_hyperparameters)) – hyperparameters at which to evaluate the log likelihood (objective function), f(x)
moe.optimal_learning.python.python_version.log_likelihood.evaluate_log_likelihood_at_hyperparameter_list(log_likelihood_evaluator, hyperparameters_to_evaluate, max_num_threads=4, status=None)[source]

Compute the specified log likelihood measure at each input set of hyperparameters.

Generally Newton or gradient descent is preferred but when they fail to converge this may be the only “robust” option. This function is also useful for plotting or debugging purposes (just to get a bunch of log likelihood values).

Parameters:
  • log_likelihood_evaluator (interfaces.log_likelihood_interface.LogLikelihoodInterface subclass) – object specifying which log likelihood measure to evaluate
  • hyperparameters_to_evaluate (array of float64 with shape (num_to_eval, log_likelihood_evaluator.num_hyperparameters)) – the hyperparameters at which to compute the specified log likelihood
  • max_num_threads (int) – maximum number of threads to use, >= 1 (UNUSED)
  • status (dict) – (output) status messages (e.g., reporting on optimizer success, etc.)
Returns:

log likelihood value at each specified set of hyperparameters

Return type:

array of float64 with shape (hyperparameters_to_evaluate.shape[0])

moe.optimal_learning.python.python_version.log_likelihood.multistart_hyperparameter_optimization(hyperparameter_optimizer, num_multistarts, randomness=None, max_num_threads=4, status=None)[source]

Select the hyperparameters that maximize the specified log likelihood measure of model fit (over the historical data) within the specified domain.

See moe.optimal_learning.python.python_version.log_likelihood.GaussianProcessLogMarginalLikelihood and moe.optimal_learning.python.python_version.log_likelihood.GaussianProcessLeaveOneOutLogLikelihood for an overview of some example log likelihood-like measures.

Optimizers are: null (‘dumb’ search), gradient descent, newton Newton is the suggested optimizer, which is not presently available in Python (use the C++ interface). In Python, gradient descent is suggested.

TODO(GH-57): Implement hessians and Newton’s method.

‘dumb’ search means this will just evaluate the objective log likelihood measure at num_multistarts ‘points’ (hyperparameters) in the domain, uniformly sampled using latin hypercube sampling.

See gpp_python_common.cpp for C++ enum declarations laying out the options for objective and optimizer types.

Currently, during optimization, we recommend that the coordinates of the initial guesses not differ from the coordinates of the optima by more than about 1 order of magnitude. This is a very (VERY!) rough guideline for sizing the domain and gd_parameters.num_multistarts; i.e., be wary of sets of initial guesses that cover the space too sparsely.

Solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

Warning

this function fails if NO improvement can be found! In that case, the output will always be the first randomly chosen point. status will report failure.

TODO(GH-56): Allow callers to pass in a source of randomness.

Parameters:
  • hyperparameter_optimizer (interfaces.optimization_interfaces.OptimizerInterface subclass) – object that optimizes (e.g., gradient descent, newton) the desired log_likelihood measure over a domain (wrt the hyperparameters of covariance)
  • num_multistarts (int > 0) – number of times to multistart hyperparameter_optimizer
  • randomness ((UNUSED)) – random source used to generate multistart points (UNUSED)
  • max_num_threads (int > 0) – maximum number of threads to use, >= 1 (UNUSED)
  • status (dict) – (output) status messages (e.g., reporting on optimizer success, etc.)
Returns:

hyperparameters that maximize the specified log likelihood measure within the specified domain

Return type:

array of float64 with shape (log_likelihood_evaluator.num_hyperparameters)

moe.optimal_learning.python.python_version.optimization module

Classes for various optimizers (maximizers) and multistarting said optimizers.

Note

comments in this module are copied from the header comments in gpp_optimization.hpp.

Table of Contents:

  1. FILE OVERVIEW
  2. OPTIMIZATION OF OBJECTIVE FUNCTIONS
    1. GRADIENT DESCENT
      1. OVERVIEW
      2. IMPLEMENTATION DETAILS
    2. NEWTON’S METHOD
      1. OVERVIEW
      2. IMPLEMENTATION DETAILS
    3. MULTISTART OPTIMIZATION

Read the “OVERVIEW” sections for header-style comments that describe the file contents at a high level. Read the “IMPLEMENTATION” comments for cpp-style comments that talk more about the specifics. Both types are included together here since this file contains template class declarations and template function definitions. For further implementation details, see comment blocks before each individual class/function.

1. FILE OVERVIEW

First, the functions in this file are all MAXIMIZERS. We also use the term “optima,” and unless we specifically state otherwise, “optima” and “optimization” refer to “maxima” and “maximization,” respectively. (Note that minimizing g(x) is equivalent to maximizing f(x) = -1 * g(x).)

This file contains templates for some common optimization techniques: gradient descent (GD) and Newton’s method. We provide constrained implementations (constraint via heuristics like restricting updates to 50% of the distance to the nearest wall) of these optimizers. For unconstrained, just set the domain to be huge: [-DBL_MAX, DBL_MAX].

We provide *Optimizer template classes (e.g., NewtonOptimizer) as main endpoints for doing local optimization (i.e., run the optimization method from a single initial guess). We also provide a MultistartOptimizer class for global optimization (i.e., start optimizers from each of a set of initial guesses). These are all discussed further below. These templated classes are general and can optimize any OptimizableInterface subclass.

In this way, we can make the local and global optimizers completely agonistic to the function they are optimizing.

2. OPTIMIZATION OF OBJECTIVE FUNCTIONS

2a. GRADIENT DESCENT (GD)

Note

Below there is some discussion of “restarted” Gradient Descent; this is not yet implemented in Python. See moe.optimal_learning.python.cpp_wrappers.optimization if you want to use this feature.

2a, i. OVERVIEW

We use first derivative information to walk the path of steepest ascent, hopefully toward a (local) maxima of the chosen log likelihood measure. This is implemented in: GradientDescentOptimization(). This method ensures that the result lies within a specified domain.

We additionally restart gradient-descent in practice; i.e., we repeatedly take the output of a GD run and start a new GD run from that point. This lives in: GradientDescentOptimizer::Optimize().

Even with restarts, gradient descent (GD) cannot start “too far” from the solution and still successfully find it. Thus users should typically start it from multiple initial guesses and take the best one (see gpp_math and gpp_model_selection for examples). The MultistartOptimizer template class in this file provides generic multistart functionality.

Finally, we optionally apply Polyak-Ruppert averaging. This is described in more detail in the docstring for GradientDescentParameters. For functions where we only have access to gradient + noise, this averaging can lead to better answers than straight gradient descent. It amounts to averaging over the final N_{avg} steps.

Gradient descent is implemented in: GradientDescentOptimizer::Optimize() (which calls GradientDescentOptimization())

2a, ii. IMPLEMENTATION DETAILS

GD’s update is: \theta_{i+1} = \theta_i + \gamma * \nabla f(\theta_i) where \gamma controls the step-size and is chosen heuristically, often varying by problem.

The previous update leads to unconstrained optimization. To ensure that our results always stay within the specified domain, we additionally limit updates if they would move us outside the domain. For example, we could imagine only moving half the distance to the nearest boundary.

With gradient descent (GD), it is hard to know what step sizes to take. Unfortunately, far enough away from an optima, the objective could increase (but very slowly). If gradient descent takes too large of a step in a bad direction, it can easily “get lost.” At the same time, taking very small steps leads to slow performance. To help, we take the standard approach of scaling down step size with iteration number. We also allow the user to specify a maximum relative change to limit the aggressiveness of GD steps. Finally, we wrap GD in a restart loop, where we fire off another GD run from the current location unless convergence was reached.

2b. NEWTON’S METHOD:

Note

Newton’s method is not yet implemented in Python. See moe.optimal_learning.python.cpp_wrappers.optimization if you want to use this feature.

2b, i. OVERVIEW

Newton’s Method (for optimization) uses second derivative information in addition to the first derivatives used by gradient descent (GD). In higher dimensions, first derivatives => gradients and second derivatives => Hessian matrix. At each iteration, gradient descent computes the derivative and blindly takes a step (of some heuristically determined size) in that direction. Care must be taken in the step size choice to balance robustness and speed while ensuring that convergence is possible. By using second derivative (the Hessian matrix in higher dimensions), which is interpretable as information about local curvature, Newton makes better* choices about step size and direction to ensure rapid** convergence.

*, ** See “IMPLEMENTATION DETAILS” comments section for details.

Recall that Newton indiscriminately finds solutions where f'(x) = 0; the eigenvalues of the Hessian classify these x as optima, saddle points, or indeterminate. We multistart Newton (e.g., gpp_model_selection) but just take the best objective value without classifying solutions. The MultistartOptimizer template class in this file provides generic multistart functionality.

Newton is implemented here: NewtonOptimizer::Optimize() (which calls NewtonOptimization())

2b, ii. IMPLEMENTATION DETAILS

Let’s address the footnotes from the previous section (Section 2b, i paragraph 1):

* Within its region of attraction, Newton’s steps are optimal (when we have only second derivative information). Outside of this region, Newton can make very poor decisions and diverge. In general, Newton is more sensitive to its initial conditions than gradient descent, but it has the potential to be much, much faster.

** By quadratic convergence, we mean that once Newton is near enough to the solution, the log of the error will roughly halve each iteration. Numerically, we would see the “number of digits” double each iteration. Again, this only happens once Newton is “close enough.”

Newton’s Method is a root-finding technique at its base. To find a root of g(x), Newton requires an initial guess, x_0, and the ability to compute g(x) and g'(x). Then the idea is that you compute root of the line tangent to g(x_0); call this x_1. And repeat. But the core idea is to make repeated linear approximations to g(x) and proceed in a fixed-point like fashion.

As an optimization method, we are looking for roots of the gradient, f'(x_{opt}) = 0. So we require an initial guess x_0 and the ability to evaluate f'(x) and f''(x) (in higher dimensions, the gradient and Hessian of f). Thus Newton makes repeated linear approximations to f'(x) or equivalently, it locally approximates f(x) with a quadratic function, continuing iteration from the optima of that quadratic. In particular, Newton would solve the optimization problem of a quadratic program in one iteration.

Mathematically, the update formulas for gradient descent (GD) and Newton are: GD: \theta_{i+1} = \theta_i +     \gamma       * \nabla f(\theta_i) Newton: \theta_{i+1} = \theta_i - H_f^-1(\theta_i) * \nabla f(\theta_i) Note: the sign of the udpate is flipped because H is negative definite near a maxima. These update schemes are similar. In GD, \gamma is chosen heuristically. There are many ways to proceed but only so much that can be done with just gradient information; moreover the standard algorithm always proceeds in the direction of the gradient. Newton takes a much more general appraoch. Instead of a scalar \gamma, the Newton update applies H^-1 to the gradient, changing both the direction and magnitude of the step.

Unfortunately, Newton indiscriminately finds solutions where f'(x) = 0. This is not necesarily an optima! In one dimension, we can have f'(x) = 0 and f''(x) = 0, in which case the solution need not be an optima (e.g., y = x^3 at x = 0). In higher dimensions, a saddle point can also result (e.g., z = x^2 - y^2 at x,y = 0). More generally, we have an optima if the Hessian is strictly negative or positive definite; a saddle if the Hessian has both positive and negative eigenvalues, and an indeterminate case if the Hessian is singular.

2c. MULTISTART OPTIMIZATION

Above, we mentioned that gradient descent (GD), Newton, etc. have a difficult time converging if they are started “too far” from an optima. Even if convergence occurs, it will typically be very slow unless the problem is simple. Worse, in a problem with multiple optima, the methods may converge to the wrong one!

Multistarting the optimizers is one way of mitigating* this issue. Multistart involves starting a run of the specified optimizer (e.g., Newton) from each of a set of initial guesses. Then the best result is reported as the result of the whole procedure. By trying a large number of initial guesses, we potentially reduce the need for good guesses; i.e., hopefully at least one guess will be “near enough” to the global optimum. This functionality is provided in MultistartOptimizer::MultistartOptimize(...).

* As noted below in the MultistartOptimizer::MultistartOptimize() function docs, mitigate is intentional here. Multistarting is NOT GUARANTEED to find global optima. But it can increase the chances of success.

Currently we let the user specify the initial guesses. In practice, this typically means a random sampling of points. We do not (yet) make any effort to say sample more heavily from regions where “more stuff is happening” or any other heuristics.

TODO(GH-165): Improve multistart heuristics.

Finally, MultistartOptimizer::MultistartOptimize() is also used to provide ‘dumb’ search functionality (optimization by just evaluating the objective at numerous points). For sufficiently complex problems, gradient descent, Newton, etc. can have exceptionally poor convergence characteristics or run too slowly. In cases where these more advanced techniques fail, we commonly fall back to ‘dumb’ search.

class moe.optimal_learning.python.python_version.optimization.COBYLAOptimizer(domain, optimizable, optimization_parameters)[source]

Bases: moe.optimal_learning.python.python_version.optimization._ScipyOptimizerWrapper

Optimizes an objective function over the specified contraints with the COBYLA method.

For more information, visit the scipy docs page and the original paper by Powell: http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.optimize.fmin_cobyla.html http://www.damtp.cam.ac.uk/user/na/NA_papers/NA2007_03.pdf

optimization_parameters_type

alias of COBYLAParameters

class moe.optimal_learning.python.python_version.optimization.COBYLAParameters[source]

Bases: moe.optimal_learning.python.python_version.optimization._BaseCOBYLAParameters

Container to hold parameters that specify the behavior of COBYLA.

Suggested values come from scipy documentation for scipy.optimize.fmin_cobyla: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_cobyla.html

Variables:
  • rhobeg – (float64 > 0.0) reasonable initial changes to the variables (suggest: 1.0)
  • rhoend – (float64 > 0.0) final accuracy in the optimization (not precisely guaranteed), which is a lower bound on the size of the trust region (suggest: 1.0e-4)
  • maxfun – (int > 0) maximum number of objective function calls to make (suggest: 1000)
  • catol – (float64 > 0.0) absolute tolerance for constraint violations (suggest: 2.0e-4)
scipy_kwargs()

Return a new OrderedDict which maps field names to their values

class moe.optimal_learning.python.python_version.optimization.GradientDescentOptimizer(domain, optimizable, optimizer_parameters, num_random_samples=None)[source]

Bases: moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface

Optimizes an objective function over the specified domain with the gradient descent method.

Note

See optimize() docstring for more details.

optimize(**kwargs)[source]

Apply gradient-descrent to to find a locally optimal (maximal here) value of the specified objective function.

Note

comments in this method are copied from the function comments of GradientDescentOptimization() in cpp/gpp_optimization.hpp.

Note

Additional high-level discussion is provided in section 2a) in the header docs of this file.

Basic gradient descent (GD) to optimize objective function f(x):

input: initial_guess

next_point = initial_guess
i = 0;
while (not converged) {
  direction = derivative of f(x) at next_point
  step_scale = compute step_size scaling: pre_mult * (i+1)^(-gamma)

  next_point += step_scale * direction
  ++i
}
if (averaging) {
  next_point = average(previous_points, average_range_start, average_range_end)
}

See GradientDescentParameters docstring or the GD code for more information on averaging.

So it marches along the direction of largest gradient (so the steepest descent) for some distance. The distance is a combination of the size of the gradient and the step_scale factor. Here, we use an exponentially decreasing scale to request progressively smaller step sizes: (i+1)^(-gamma), where i is the iteration number

We do not allow the step to take next_point out of the domain; if this happens, the update is limited. Thus the solution is guaranteed to lie within the region specified by “domain”; note that this may not be a true optima (i.e., the gradient may be substantially nonzero).

We may also limit very large updates (controlled via max_relative_change). Decreasing this value makes gradient descent (GD) more stable but also slower. For very sensitive problems like hyperparameter optimization, max_relative_change = 0.02 is suggested; for less sensitive problems (e.g., EI, especially analytic), you can use 1.0 (or near).

The constraint implementation (no stepping outside the domain) and the large update limiting are not “pure” gradient descent approaches. They are all heuristics meant to improve Newton’s robustness. The constraint implementation in particular may lead to non-convergence and it also may not find constrained optima that lie exactly on a boundary. This would require a more general handling where we search in an d-1 dimensional subspace (i.e., only on the boundary).

Note that we are using an absolute tolerance here, based on the size of the most recent step. The suggested value is 1.0e-7, although this may need to be loosened for problems with ‘difficult’ optima (e.g., the shape is not locally very peaked). Setting too high of a tolerance can cause wrong answers–e.g., we stop at a point that is not an optima but simply an region with small gradient. Setting the tolerance too low may make convergence impossible; GD could get stuck (bouncing between the same few points) or numerical effects could make it impossible to satisfy tolerance.

Finally, GD terminates if updates are very small.

class moe.optimal_learning.python.python_version.optimization.GradientDescentParameters[source]

Bases: moe.optimal_learning.python.python_version.optimization._BaseGradientDescentParameters

Container to hold parameters that specify the behavior of Gradient Descent.

Note

the following comments are copied from cpp_wrappers.optimization.GradientDescentParameters

Iterations

The total number of gradient descent steps is at most num_multistarts * max_num_steps * max_num_restarts Generally, allowing more iterations leads to a better solution but costs more time.

Averaging

When optimizing stochastic objective functions, it can often be beneficial to average some number of gradient descent steps to obtain the final result (vs just returning the last step). Polyak-Ruppert averaging: postprocessing step where we replace x_n with: \overbar{x} = \frac{1}{n - n_0} \sum_{t=n_0 + 1}^n x_t n_0 = 0 averages all steps; n_0 = n - 1 is equivalent to returning x_n directly. Here, num_steps_averaged is n - n_0.

  • num_steps_averaged < 0: averages all steps
  • num_steps_averaged == 0: do not average
  • num_steps_averaged > 0 and <= max_num_steps: average the specified number of steps
  • max_steps_averaged > max_num_steps: average all steps

Learning Rate

GD may be implemented using a learning rate: pre_mult * (i+1)^{-\gamma}, where i is the current iteration Larger gamma causes the GD step size to (artificially) scale down faster. Smaller pre_mult (artificially) shrinks the GD step size. Generally, taking a very large number of small steps leads to the most robustness; but it is very slow.

Tolerances

Larger relative changes are potentially less robust but lead to faster convergence. Large tolerances run faster but may lead to high errors or false convergence (e.g., if the tolerance is 1.0e-3 and the learning rate control forces steps to fall below 1.0e-3 quickly, then GD will quit “successfully” without genuinely converging.)

Variables:
  • max_num_steps – (int > 0) maximum number of gradient descent iterations per restart (suggest: 200-1000)
  • max_num_restarts – (int > 0) maximum number of gradient descent restarts, the we are allowed to call gradient descent. Should be >= 2 as a minimum (suggest: 10-20)
  • num_steps_averaged – (int) number of steps to use in polyak-ruppert averaging (see above) (suggest: 10-50% of max_num_steps for stochastic problems, 0 otherwise)
  • gamma – (float64 > 1.0) exponent controlling rate of step size decrease (see struct docs or GradientDescentOptimizer) (suggest: 0.5-0.9)
  • pre_mult – (float64 > 1.0) scaling factor for step size (see struct docs or GradientDescentOptimizer) (suggest: 0.1-1.0)
  • max_relative_change – (float64 in [0, 1]) max change allowed per GD iteration (as a relative fraction of current distance to wall) (suggest: 0.5-1.0 for less sensitive problems like EI; 0.02 for more sensitive problems like hyperparameter opt)
  • tolerance – (float 64 >= 0.0) when the magnitude of the gradient falls below this value OR we will not move farther than tolerance (e.g., at a boundary), stop. (suggest: 1.0e-7)
class moe.optimal_learning.python.python_version.optimization.LBFGSBOptimizer(domain, optimizable, optimization_parameters, num_random_samples=None)[source]

Bases: moe.optimal_learning.python.python_version.optimization._ScipyOptimizerWrapper

Optimizes an objective function over the specified domain with the L-BFGS-B method.

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm is a quasi-Newton algorithm for optimization. It can be used for DFO (Derivative-Free Optimization) when the gradient is not available, such as is the case for the analytic qEI algorithm.

L-BFGS is a memory efficient version of BFGS, and BFGS-B is a variant that handles simple box constraints. We use L-BFGS-B, which is a combination of the two, and is often the optimization algorithm of choice for these types of problems.

For more information, visit the scipy docs and the wikipedia page on BFGS: http://en.wikipedia.org/wiki/Limited-memory_BFGS http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html

optimization_parameters_type

alias of LBFGSBParameters

class moe.optimal_learning.python.python_version.optimization.LBFGSBParameters[source]

Bases: moe.optimal_learning.python.python_version.optimization._BaseLBFGSBParameters

Container to hold parameters that specify the behavior of L-BFGS-B.

Suggested values come from scipy documentation for scipy.optimize.fmin_l_bfgs_b: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_l_bfgs_b.html

Variables:
  • approx_grad – (bool) if true, BFGS will approximate the gradient
  • max_func_evals – (int > 0) maximum number of objective function calls to make (suggest: 15000)
  • max_metric_correc – (int > 0) maximum number of variable metric corrections used to define the limited memorty matrix (suggest: 10)
  • factr – (float64 > 1.0) 1e12 for low accuracy, 1e7 for moderate accuracy, and 10 for extremely high accuracy (suggest: 1000.0)
  • pgtol – (float64 > 0.0) cutoff for highest component of gradient to be considered a critical point (suggest: 1.0e-5)
  • epsilon – (float64 > 0.0) step size for approximating the gradient (suggest: 1.0e-8)
scipy_kwargs()[source]

Return a dict that can be unpacked as kwargs to scipy.optimize.fmin_l_bfgs_b.

Returns:kwargs for controlling the behavior of fmin_l_bfgs_b
Return type:dict
class moe.optimal_learning.python.python_version.optimization.MultistartOptimizer(optimizer, num_multistarts)[source]

Bases: moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface

A general class for multistarting any class that implements interfaces.optimization_interface.OptimizerInterface (except itself).

Note

comments copied from MultistartOptimizer in gpp_optimization.hpp.

The use with GradientDescentOptimizer, NewtonOptimizer, etc. are standard practice in nonlinear optimization. In particular, without special properties like convexity, single-start optimizers can converge to local optima. In general, a nonlinear function can have many local optima, so the only way to improve* your chances of finding the global optimum is to start from many different locations.

* Improve is intentional here. In the general case, you are not guaranteed (in finite time) to find the global optimum.

Use with NullOptimizer requires special mention here as it might seem silly. This case reduces to evaluating the objective function at every point of initial_guesses. Through function_values, you can get the objective value at each of point of initial_guesses too (e.g., for plotting). So use MultistartOptimize with NullOptimzer to perform a ‘dumb’ search (e.g., initial_guesses can be obtained from a grid, random sampling, etc.). NullOptimizer allows ‘dumb’ search to use the same code as multistart optimization. ‘Dumb’ search is inaccurate but it never fails, so we often use it as a fall-back when more advanced (e.g., gradient descent) techniques fail.

optimize(random_starts=None, **kwargs)[source]

Perform multistart optimization with self.optimizer.

Note

comments copied from MultistartOptimizer::MultistartOptimize in gpp_optimization.hpp.

Performs multistart optimization with the specified Optimizer (instance variable) to optimize the specified OptimizableInterface (objective function) over the specified DomainInterface. Optimizer behavior is controlled by the specified ParameterStruct. See class docs and header docs of this file, section 2c and 3b, iii), for more information.

The method allows you to specify what the current best is, so that if optimization cannot beat it, no improvement will be reported. It will otherwise report the overall best improvement (through io_container) as well as the result of every individual multistart run if desired (through function_values).

Parameters:random_starts (array of float64 with shape (num_points, dim) or None) – points from which to multistart self.optimizer; if None, points are chosen randomly
Returns:(best point found, objective function values at the end of each optimization run)
Return type:tuple: (array of float64 with shape (self.optimizer.dim), array of float64 with shape (self.num_multistarts))
class moe.optimal_learning.python.python_version.optimization.NewtonParameters[source]

Bases: moe.optimal_learning.python.python_version.optimization._BaseNewtonParameters

See docstring at moe.optimal_learning.python.cpp_wrappers.optimization.NewtonParameters.

class moe.optimal_learning.python.python_version.optimization.NullOptimizer(domain, optimizable, *args, **kwargs)[source]

Bases: moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface

A “null” or identity optimizer: this does nothing. It is used to perform “dumb” search with MultistartOptimizer.

optimize(*args, **kwargs)[source]

Do nothing; arguments are unused.

class moe.optimal_learning.python.python_version.optimization.NullParameters[source]

Bases: moe.optimal_learning.python.python_version.optimization.NullParameters

Empty container for optimizers that do not require any parameters (e.g., the null optimizer).

moe.optimal_learning.python.python_version.optimization.multistart_optimize(optimizer, starting_points=None, num_multistarts=None)[source]

Multistart the specified optimizer randomly or from the specified list of initial guesses.

If starting_points is specified, this will always multistart from those points. If starting_points is not specified and num_multistarts is specified, we start from a random set of points. If starting_points is not specified and num_multistarts is not specified, an exception is raised.

This is a simple wrapper around MultistartOptimizer.

Parameters:
  • optimizer (interfaces.optimization_interface.OptimizerInterface subclass) – object that will perform the optimization
  • starting_points (array of float64 with shape (num_points, evaluator.problem_size)) – points at which to initialize optimization runs
Returns:

(best point found, objective function values at the end of each optimization run)

Return type:

tuple: (array of float64 with shape (optimizer.dim), array of float64 with shape (starting_points.shape[0]) or (num_multistarts))

Raises:

ValueError: if both starting_points and num_multistarts are None

moe.optimal_learning.python.python_version.python_utils module

Utilities for computing covariance matrices and related structures.

moe.optimal_learning.python.python_version.python_utils.build_covariance_matrix(covariance, points_sampled, noise_variance=None)[source]

Compute the covariance matrix, K, of a list of points, X_i.

Note

These comments are copied from BuildCovarianceMatrix() in gpp_math.cpp.

Matrix is computed as: A_{i,j} = covariance(X_i, X_j) + \delta_{i,j}*noise_i. where \delta_{i,j} is the Kronecker delta, equal to 1 if i == j and 0 else. Result is SPD assuming covariance operator is SPD and points are unique.

Generally, this is called from other functions with “points_sampled” as the input and not any arbitrary list of points; hence the very specific input name.

Point list cannot contain duplicates. Doing so (or providing nearly duplicate points) can lead to semi-definite matrices or very poor numerical conditioning.

Parameters:
  • covariance (interfaces.covariance_interface.CovarianceInterface subclass) – the covariance function encoding assumptions about the GP’s behavior on our data
  • points_sampled (array of float64 with shape (points_sampled.shape[0], dim)) – points, X_i
  • noise_variance (array of float64 with shape (points_sampled.shape[0])) – i-th entry is amt of noise variance to add to i-th diagonal entry; i.e., noise measuring i-th point
Returns:

covariance matrix

Return type:

array of float64 with shape(points_sampled.shape[0], points_sampled.shape[0]), order=’F’

Note

Fortran ordering is important here; scipy.linalg factor/solve methods (e.g., cholesky, solve_triangular) implicitly require order=’F’ to enable overwriting. This output is commonly overwritten.

moe.optimal_learning.python.python_version.python_utils.build_hyperparameter_grad_covariance_matrix(covariance, points_sampled)[source]

Build A_{jik} = \pderiv{K_{ij}}{\theta_k}.

Note

These comments are copied from BuildHyperparameterGradCovarianceMatrix() in gpp_model_selection.cpp.

Build A_{jik} = \pderiv{K_{ij}}{\theta_k} Hence the outer loop structure is identical to BuildCovarianceMatrix().

Note the structure of the resulting tensor is num_hyperparameters blocks of size num_sampled X num_sampled. Consumers of this want dK/d\theta_k located sequentially. However, for a given pair of points (x, y), it is more efficient to compute all hyperparameter derivatives at once. Thus the innermost loop writes to all num_hyperparameters blocks at once.

Consumers of this result generally require complete storage (i.e., will not take advantage of its symmetry), so instead of ignoring the upper triangles, we copy them from the (already-computed) lower triangles to avoid redundant work.

Since CovarianceInterface.HyperparameterGradCovariance() returns a vector of size |\theta_k|, the inner loop writes all relevant entries of A_{jik} simultaneously to prevent recomputation.

Parameters:
  • covariance (interfaces.covariance_interface.CovarianceInterface subclass) – the covariance function encoding assumptions about the GP’s behavior on our data
  • points_sampled (array of float64 with shape (points_sampled.shape[0], dim)) – points, X_i
Returns:

gradient of covariance matrix wrt hyperparameters

Return type:

array of float64 with shape (points_sampled.shape[0], points_sampled.shape[0], num_hyperparameters), order=’F’

Note

Fortran ordering is important here; scipy.linalg factor/solve methods (e.g., cholesky, solve_triangular) implicitly require order=’F’ to enable overwriting. This output is commonly overwritten.

moe.optimal_learning.python.python_version.python_utils.build_mix_covariance_matrix(covariance, points_sampled, points_to_sample)[source]

Compute the “mix” covariance matrix, Ks, of Xs and X (points_to_sample and points_sampled, respectively).

Note

These comments are copied from BuildMixCovarianceMatrix() in gpp_math.cpp.

Matrix is computed as: A_{i,j} = covariance(X_i, Xs_j). Result is not guaranteed to be SPD and need not even be square.

Generally, this is called from other functions with “points_sampled” and “points_to_sample” as the input lists and not any arbitrary list of points; hence the very specific input name. But this is not a requirement.

Point lists cannot contain duplicates with each other or within themselves.

Parameters:
  • covariance (interfaces.covariance_interface.CovarianceInterface subclass) – the covariance function encoding assumptions about the GP’s behavior on our data
  • points_sampled (array of float64 with shape (points_sampled.shape[0], dim)) – points, X_i
  • points_to_sample (array of float64 with shape (points_to_sample.shape[0], dim)) – points, Xs_i
Returns:

“mix” covariance matrix

Return type:

array of float64 with shape (points_sampled.shape[0], points_to_sample.shape[0]), order=’F’

Note

Fortran ordering is important here; scipy.linalg factor/solve methods (e.g., cholesky, solve_triangular) implicitly require order=’F’ to enable overwriting. This output is commonly overwritten.

Module contents

Implementations of the ABCs in the moe.optimal_learning.python.interfaces package using Python (as opposed to C++ calls).

The modules in this package are meant to fill two main purposes:

  1. Provide a (hopefully) easy-to-read Python implementation to help familiarize people (especially those who are not well-versed in C++) with the features of the optimal_learning library.
  2. Provide a convenient work/test environment for developers to try out new algorithms, features, and ideas. If you have a new way to compute Expected Improvement, you can quickly develop the algorithm in Python and then use either moe.optimal_learning.python.python_version.gaussian_process or (faster) moe.optimal_learning.python.cpp_wrappers.gaussian_process. Or you can test out some new optimization methods in Python and connect your optimizers to the objective functions in this package or offload the expensive computation to C++ via cpp_wrappers.

Unlike the interface implementations in the cpp_wrappers package, the classes in this package are all composable with each other and with classes in cpp_wrappers.

Modules in this package make extensive use of numpy and scipy. As mentioned in the interface documentation, all functions return and accept numpy arrays.

See the package comments for moe.optimal_learning.python.interfaces for an overview of optimal_learning’s capabilities.