moe.optimal_learning.python.interfaces package¶
Submodules¶
moe.optimal_learning.python.interfaces.covariance_interface module¶
Interface for covariance function: covariance of two points and spatial/hyperparameter derivatives.
Note
comments are copied from the file comments of gpp_covariance.hpp
Covariance functions have a few fundamental properties (see references at the bottom for full details). In short, they are SPSD (symmetric positive semi-definite): k(x,x') = k(x', x) for any x,x' and k(x,x) >= 0 for all x. As a consequence, covariance matrices are SPD as long as the input points are all distinct.
Additionally, the Square Exponential and Matern covariances (as well as other functions) are stationary. In essence, this means they can be written as k(r) = k(|x - x'|) = k(x, x') = k(x', x). So they operate on distances between points as opposed to the points themselves. The name stationary arises because the covariance is the same modulo linear shifts: k(x+a, x'+a) = k(x, x').
Covariance functions are a fundamental component of gaussian processes: as noted in the gpp_math.hpp header comments, gaussian processes are defined by a mean function and a covariance function. Covariance functions describe how two random variables change in relation to each other–more explicitly, in a GP they specify how similar two points are. The choice of covariance function is important because it encodes our assumptions about how the “world” behaves.
Covariance functions also generally have hyperparameters (e.g., signal/background noise, length scales) that specify the assumed behavior of the Gaussian Process. Specifying hyperparameters is tricky because changing them fundamentally changes the behavior of the GP. moe.optimal_learning.python.interfaces.optimization_interface together with moe.optimal_learning.python.interfaces.log_likelihood_interface provide methods optimizing and evaluating model fit, respectively.
- class moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface[source]¶
Bases: object
Interface for a covariance function: covariance of two points and spatial/hyperparameter derivatives.
Note
comments are copied from the class comments of CovarianceInterface in gpp_covariance.hpp
Abstract class to enable evaluation of covariance functions–supports the evaluation of the covariance between two points, as well as the gradient with respect to those coordinates and gradient/hessian with respect to the hyperparameters of the covariance function.
Covariance operaters, cov(x_1, x_2) are SPD. Due to the symmetry, there is no need to differentiate wrt x_1 and x_2; hence the gradient operation should only take gradients wrt dim variables, where dim = |x_1|
Hyperparameters (denoted \theta_j) are stored as class member data by subclasses.
Implementers of this ABC are required to manage their own hyperparameters.
TODO(GH-71): getter/setter for hyperparameters.
- covariance(point_one, point_two)[source]¶
Compute the covariance function of two points, cov(point_one, point_two).
Note
comments are copied from the matching method comments of CovarianceInterface in gpp_covariance.hpp and comments are copied to the matching method comments of moe.optimal_learning.python.python_version.covariance.SquareExponential.
The covariance function is guaranteed to be symmetric by definition: covariance(x, y) = covariance(y, x). This function is also positive definite by definition.
Parameters: - point_one (array of float64 with shape (dim)) – first input, the point x
- point_two (array of float64 with shape (dim)) – second input, the point y
Returns: value of covariance between the input points
Return type: float64
- get_hyperparameters()[source]¶
Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.
- grad_covariance(point_one, point_two)[source]¶
Compute the gradient of self.covariance(point_one, point_two) with respect to the FIRST argument, point_one.
Note
comments are copied from the matching method comments of CovarianceInterface in gpp_covariance.hpp and comments are copied to the matching method comments of moe.optimal_learning.python.python_version.covariance.SquareExponential.
This distinction is important for maintaining the desired symmetry. Cov(x, y) = Cov(y, x). Additionally, \pderiv{Cov(x, y)}{x} = \pderiv{Cov(y, x)}{x}. However, in general, \pderiv{Cov(x, y)}{x} != \pderiv{Cov(y, x)}{y} (NOT equal! These may differ by a negative sign)
Hence to avoid separate implementations for differentiating against first vs second argument, this function only handles differentiation against the first argument. If you need \pderiv{Cov(y, x)}{x}, just swap points x and y.
Parameters: - point_one (array of float64 with shape (dim)) – first input, the point x
- point_two (array of float64 with shape (dim)) – second input, the point y
Returns: grad_cov: i-th entry is \pderiv{cov(x_1, x_2)}{x_i}
Return type: array of float64 with shape (dim)
- hyperparameter_grad_covariance(point_one, point_two)[source]¶
Compute the gradient of self.covariance(point_one, point_two) with respect to its hyperparameters.
Note
comments are copied from the matching method comments of CovarianceInterface in gpp_covariance.hpp and comments are copied to the matching method comments of moe.optimal_learning.python.python_version.covariance.SquareExponential.
Unlike GradCovariance(), the order of point_one and point_two is irrelevant here (since we are not differentiating against either of them). Thus the matrix of grad covariances (wrt hyperparameters) is symmetric.
Parameters: - point_one (array of float64 with shape (dim)) – first input, the point x
- point_two (array of float64 with shape (dim)) – second input, the point y
Returns: grad_hyperparameter_cov: i-th entry is \pderiv{cov(x_1, x_2)}{\theta_i}
Return type: array of float64 with shape (num_hyperparameters)
- hyperparameter_hessian_covariance(point_one, point_two)[source]¶
Compute the hessian of self.covariance(point_one, point_two) with respect to its hyperparameters.
Note
comments are copied from the matching method comments of CovarianceInterface in gpp_covariance.hpp
The Hessian matrix of the covariance evaluated at x_1, x_2 with respect to the hyperparameters. The Hessian is defined as: [ \ppderiv{cov}{\theta_0^2} \mixpderiv{cov}{\theta_0}{\theta_1} ... \mixpderiv{cov}{\theta_0}{\theta_{n-1}} ] [ \mixpderiv{cov}{\theta_1}{\theta_0} \ppderiv{cov}{\theta_1^2 } ... \mixpderiv{cov}{\theta_1}{\theta_{n-1}} ] [ ... ... ] [ \mixpderiv{cov}{\theta_{n-1}{\theta_0} \mixpderiv{cov}{\theta_{n-1}{\theta_1} ... \ppderiv{cov}{\theta_{n-1}^2} ] where “cov” abbreviates covariance(x_1, x_2) and “n” refers to the number of hyperparameters.
Unless noted otherwise in subclasses, the Hessian is symmetric (due to the equality of mixed derivatives when a function f is twice continuously differentiable).
Similarly to the gradients, the Hessian is independent of the order of x_1, x_2: H_{cov}(x_1, x_2) = H_{cov}(x_2, x_1)
For further details: http://en.wikipedia.org/wiki/Hessian_matrix
Parameters: - point_one (array of float64 with shape(dim)) – first input, the point x
- point_two (array of float64 with shape (dim)) – second input, the point y
Returns: hessian_hyperparameter_cov: (i,j)-th entry is \mixpderiv{cov(x_1, x_2)}{\theta_i}{\theta_j}
Return type: array of float64 with shape (num_hyperparameters, num_hyperparameters)
- hyperparameters¶
Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.
moe.optimal_learning.python.interfaces.domain_interface module¶
Interface for a domain: in/out test, random point generation, and update limiting (for constrained optimization).
- class moe.optimal_learning.python.interfaces.domain_interface.DomainInterface[source]¶
Bases: object
Interface for a domain: in/out test, random point generation, and update limiting (for constrained optimization).
- check_point_inside(point)[source]¶
Check if a point is inside the domain/on its boundary or outside.
Parameters: point (array of float64 with shape (dim)) – point to check Returns: true if point is inside the domain Return type: bool
- compute_update_restricted_to_domain(max_relative_change, current_point, update_vector)[source]¶
Compute a new update so that CheckPointInside(current_point + new_update) is true.
- Changes new_update_vector so that:
- point_new = point + new_update_vector
has coordinates such that CheckPointInside(point_new) returns true.
new_update_vector is a function of update_vector. new_update_vector is just a copy of update_vector if current_point is already inside the domain.
Note
We modify update_vector (instead of returning point_new) so that further update limiting/testing may be performed.
Parameters: - max_relative_change (float64 in (0, 1]) – max change allowed per update (as a relative fraction of current distance to boundary)
- current_point (array of float64 with shape (dim)) – starting point
- update_vector (array of float64 with shape (dim)) – proposed update
Returns: new update so that the final point remains inside the domain
Return type: array of float64 with shape (dim)
- generate_random_point_in_domain(random_source=None)[source]¶
Generate point uniformly at random such that self.check_point_inside(point) is True.
Note
if you need multiple points, use generate_uniform_random_points_in_domain instead; depending on implementation, it may ield better distributions over many points. For example, tensor product type domains use latin hypercube sampling instead of repeated random draws which guarantees that no non-uniform clusters may arise (in subspaces) versus this method which treats all draws independently.
Returns: point in domain Return type: array of float64 with shape (dim)
- generate_uniform_random_points_in_domain(num_points, random_source)[source]¶
Generate AT MOST num_points uniformly distributed points from the domain.
Note
The number of points returned may be LESS THAN num_points!
Implementations may use rejection sampling. In such cases, generating the requested number of points may be unreasonably slow, so implementers are allowed to generate fewer than num_points results.
Parameters: - num_points (int >= 0) – max number of points to generate
- random_source (callable yielding uniform random numbers in [0,1]) –
Returns: uniform random sampling of points from the domain; may be fewer than num_points!
Return type: array of float64 with shape (num_points_generated, dim)
moe.optimal_learning.python.interfaces.expected_improvement_interface module¶
Interface for computation of the Expected Improvement at points sampled from a GaussianProcess.
Note
These comments were copied from the file comments in gpp_math.cpp.
See the package docs (moe.optimal_learning.python.interfaces) for the basics of expected improvement and the definition of the q,p-EI problem.
Then the improvement for this single sample is: I = { best_known - min(y) if (best_known - min(y) > 0) (Equation 5) `` { 0 else`` where y is a particular prediction from the underlying Gaussian Process and best_known is the best observed value (min) so far.
And the expected improvement, EI, can be computed by averaging repeated computations of I; i.e., monte-carlo integration. This is done in moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImprovementInterface.compute_expected_improvement; we can also compute the gradient. This computation is needed in the optimization of q,p-EI.
There is also a special, analytic case of EI computation that does not require monte-carlo integration. This special case can only be used to compute 1,0-EI (and its gradient). Still this can be very useful (e.g., the heuristic optimization in gpp_heuristic_expected_improvement_optimization.hpp estimates q,0-EI by repeatedly solving 1,0-EI).
From there, since EI is taken from a sum of gaussians, we expect it to be reasonably smooth and apply multistart, restarted gradient descent to find the optimum. The use of gradient descent implies the need for all of the various “grad” functions, e.g., gaussian_process.compute_grad_mean_of_points(). This is handled by coupling an implementation of moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImprovementInterface to an optimizer (moe.optimal_learning.python.interfaces.optimization_interface).
- class moe.optimal_learning.python.interfaces.expected_improvement_interface.ExpectedImprovementInterface[source]¶
Bases: object
Interface for Expected Improvement computation: EI and its gradient at specified point(s) sampled from a GaussianProcess.
A class to encapsulate the computation of expected improvement and its spatial gradient using points sampled from an associated GaussianProcess. The general EI computation requires monte-carlo integration; it can support q,p-EI optimization. It is designed to work with any GaussianProcess.
See file docs for a description of what EI is and an overview of how it can be computed.
Implementers are responsible for dealing with PRNG state for any randomness needed in EI computation. Implementers are also responsible for storing points_to_sample and points_being_sampled:
Parameters: - points_to_sample (array of float64 with shape (num_to_sample, dim)) – points at which to evaluate EI and/or its gradient to check their value in future experiments (i.e., “q” in q,p-EI)
- points_being_sampled (array of float64 with shape (num_being_sampled, dim)) – points being sampled in concurrent experiments (i.e., “p” in q,p-EI)
- compute_expected_improvement(**kwargs)[source]¶
Compute the expected improvement at points_to_sample, with points_being_sampled concurrent points being sampled.
Note
These comments were copied from ExpectedImprovementEvaluator::ComputeExpectedImprovement in gpp_math.hpp. and duplicated in moe.optimal_learning.python.cpp_wrappers.expected_improvement.ExpectedImprovement.compute_expected_improvement() and moe.optimal_learning.python.python_version.expected_improvement.ExpectedImprovement.compute_expected_improvement()
points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.
Computes the expected improvement EI(Xs) = E_n[[f^*_n(X) - min(f(Xs_1),...,f(Xs_m))]^+], where Xs are potential points to sample (union of points_to_sample and points_being_sampled) and X are already sampled points. The ^+ indicates that the expression in the expectation evaluates to 0 if it is negative. f^*(X) is the MINIMUM over all known function evaluations (points_sampled_value), whereas f(Xs) are GP-predicted function evaluations.
In words, we are computing the expected improvement (over the current best_so_far, best known objective function value) that would result from sampling (aka running new experiments) at points_to_sample with points_being_sampled concurrent/ongoing experiments.
In general, the EI expression is complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.
The idea of the MC approach is to repeatedly sample at the union of points_to_sample and points_being_sampled. This is analogous to gaussian_process_interface.sample_point_from_gp, but we sample num_union points at once: y = \mu + Lw where \mu is the GP-mean, L is the chol_factor(GP-variance) and w is a vector of num_union draws from N(0, 1). Then: improvement_per_step = max(max(best_so_far - y), 0.0) Observe that the inner max means only the smallest component of y contributes in each iteration. We compute the improvement over many random draws and average.
Returns: the expected improvement from sampling points_to_sample with points_being_sampled concurrent experiments Return type: float64
- compute_grad_expected_improvement(**kwargs)[source]¶
Compute the gradient of expected improvement at points_to_sample wrt points_to_sample, with points_being_sampled concurrent samples.
Note
These comments were copied from ExpectedImprovementEvaluator::ComputeGradExpectedImprovement in gpp_math.hpp and duplicated moe.optimal_learning.python.cpp_wrappers.expected_improvement.ExpectedImprovement.compute_grad_expected_improvement() and moe.optimal_learning.python.python_version.expected_improvement.ExpectedImprovement.compute_grad_expected_improvement()
points_to_sample is the “q” and points_being_sampled is the “p” in q,p-EI.
In general, the expressions for gradients of EI are complex and difficult to evaluate; hence we use Monte-Carlo simulation to approximate it. When faster (e.g., analytic) techniques are available, we will prefer them.
The MC computation of grad EI is similar to the computation of EI (decsribed in compute_expected_improvement). We differentiate y = \mu + Lw wrt points_to_sample; only terms from the gradient of \mu and L contribute. In EI, we computed: improvement_per_step = max(max(best_so_far - y), 0.0) and noted that only the smallest component of y may contribute (if it is > 0.0). Call this index winner. Thus in computing grad EI, we only add gradient terms that are attributable to the winner-th component of y.
Returns: gradient of EI, \pderiv{EI(Xq \cup Xp)}{Xq_{i,d}} where Xq is points_to_sample and Xp is points_being_sampled (grad EI from sampling points_to_sample with points_being_sampled concurrent experiments wrt each dimension of the points in points_to_sample) Return type: array of float64 with shape (num_to_sample, dim)
moe.optimal_learning.python.interfaces.gaussian_process_interface module¶
Interface for a GaussianProcess: mean, variance, gradients thereof, and data I/O.
This file contains two classes, moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessDataInterface and moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessInterface. They specifies the interface that a GaussianProcess implementation must satisfy in order to be used in computation/optimization of ExpectedImprovement, etc. Python currently does not natively support interfaces, so we are commandeering ABCs for that purpose.
See package docs in moe.optimal_learning.python.interfaces for an introduction to Gaussian Processes.
- class moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessDataInterface[source]¶
Bases: object
Core data interface for constructing or manipulating a Gaussian Process Prior (GPP).
This interface exists as a convenience to safely access the fundamental components of a GPP.
Assumes the underlying GP has mean zero.
Includes functions to return copies of the covariance function (see CovarianceInterface) and observed, historical data (coordinates, function values, noise variance; see moe.optimal_learning.python.data_containers.HistoricalData) of a GP object or an object supporting computations on GPs.
With the zero mean assumption, a “Gaussian Process” is fully determined by its covariance function (Rasmussen & Williams, Chp 2.2). Then the “Prior” is fully determined by our past observations. Together, the covariance and historical data produce the heart of a Gaussian Process Prior.
- get_core_data_copy()[source]¶
Tuple of covariance, historical_data copies for convenience; see specific getters.
- get_covariance_copy()[source]¶
Return a copy of the covariance object specifying the Gaussian Process.
Returns: covariance object encoding assumptions about the GP’s behavior on our data Return type: moe.optimal_learning.python.interfaces.covariance_interface.CovarianceInterface subclass
- get_historical_data_copy()[source]¶
Return the data (points, function values, noise) specifying the prior of the Gaussian Process.
Returns: object specifying the already-sampled points, the objective value at those points, and the noise variance associated with each observation Return type: moe.optimal_learning.python.data_containers.HistoricalData
- class moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessInterface[source]¶
Bases: moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessDataInterface
Interface for a GaussianProcess: mean, variance, gradients thereof, and data I/O.
Note
comments in this class are copied from GaussianProcess in gpp_math.hpp and duplicated in cpp_wrappers.gaussian_process and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess and moe.optimal_learning.python.python_version.gaussian_process.GaussianProcess
Object that encapsulates Gaussian Process Priors (GPPs). A GPP is defined by a set of (sample point, function value, noise variance) triples along with a covariance function that relates the points. Each point has dimension dim. These are the training data; for example, each sample point might specify an experimental cohort and the corresponding function value is the objective measured for that experiment. There is one noise variance value per function value; this is the measurement error and is treated as N(0, noise_variance) Gaussian noise.
GPPs estimate a real process ms f(x) = GP(m(x), k(x,x’))me (see file docs). This class deals with building an estimator to the actual process using measurements taken from the actual process–the (sample point, function val, noise) triple. Then predictions about unknown points can be made by sampling from the GPP–in particular, finding the (predicted) mean and variance. These functions (and their gradients) are provided in ComputeMeanOfPoints, ComputeVarianceOfPoints, etc.
Further mathematical details are given in the implementation comments, but we are essentially computing:
ComputeMeanOfPoints : K(Xs, X) * [K(X,X) + \sigma_n^2 I]^{-1} * y = Ks^T * K^{-1} * yComputeVarianceOfPoints: K(Xs, Xs) - K(Xs,X) * [K(X,X) + \sigma_n^2 I]^{-1} * K(X,Xs) = Kss - Ks^T * K^{-1} * KsThis (estimated) mean and variance characterize the predicted distributions of the actual ms m(x), k(x,x’)me functions that underly our GP.
The “independent variables” for this object are points_to_sample. These points are both the “p” and the “q” in q,p-EI; i.e., they are the parameters of both ongoing experiments and new predictions. Recall that in q,p-EI, the q points are called points_to_sample and the p points are called points_being_sampled. Here, we need to make predictions about both point sets with the GP, so we simply call the union of point sets points_to_sample.
In GP computations, there is really no distinction between the “q” and “p” points from EI, points_to_sample and points_being_sampled, respectively. However, in EI optimization, we only need gradients of GP quantities wrt points_to_sample, so users should call members functions with num_derivatives = num_to_sample in that context.
- add_sampled_points(sampled_points)[source]¶
Add a sampled points (point, value, noise) to the GP’s prior data.
Also forces recomputation of all derived quantities for GP to remain consistent.
Parameters: sampled_points (single moe.optimal_learning.python.SamplePoint or list of SamplePoint objects) – SamplePoint objects to load into the GP (containing point, function value, and noise variance)
- compute_cholesky_variance_of_points(points_to_sample)[source]¶
Compute the cholesky factorization of the variance (matrix) of this GP at each point of Xs (points_to_sample).
points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.
Parameters: points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP Returns: cholesky factorization of the variance matrix of this GP, lower triangular Return type: array of float64 with shape (num_to_sample, num_to_sample), lower triangle filled in
- compute_grad_cholesky_variance_of_points(points_to_sample, num_derivatives)[source]¶
Compute the gradient of the cholesky factorization of the variance (matrix) of this GP at each point of Xs (points_to_sample) wrt Xs.
points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.
This function accounts for the effect on the gradient resulting from cholesky-factoring the variance matrix. See Smith 1995 for algorithm details.
Note that grad_chol is nominally sized: grad_chol[num_to_sample][num_to_sample][num_to_sample][dim]. Let this be indexed grad_chol[k][j][i][d], which is read the derivative of var[j][i] with respect to x_{k,d} (x = points_to_sample)
Note
Comments are copied from GaussianProcess in gpp_math.hpp and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess.compute_grad_cholesky_variance_of_points().
Parameters: - points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
- var_of_grad (integer in {0, .. num_to_sample-1}) – index of points_to_sample to be differentiated against
Returns: grad_chol: gradient of the cholesky factorization of the variance matrix of this GP. grad_chol[k][j][i][d] is actually the gradients of var_{j,i} with respect to x_{k,d}, the d-th dimension of the k-th entry of points_to_sample
Return type: array of float64 with shape (num_derivatives, num_to_sample, num_to_sample, dim)
- compute_grad_mean_of_points(points_to_sample, num_derivatives)[source]¶
Compute the gradient of the mean of this GP at each of point of Xs (points_to_sample) wrt Xs.
points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.
Note that grad_mu is nominally sized: grad_mu[num_to_sample][num_to_sample][dim]. This is the the d-th component of the derivative evaluated at the i-th input wrt the j-th input. However, for 0 <= i,j < num_to_sample, i != j, grad_mu[j][i][d] = 0. (See references or implementation for further details.) Thus, grad_mu is stored in a reduced form which only tracks the nonzero entries.
Note
Comments are copied from GaussianProcess in gpp_math.hpp and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess.compute_grad_mean_of_points().
Parameters: - points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
- num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns: grad_mu: gradient of the mean of the GP. grad_mu[i][d] is actually the gradient of \mu_i wrt x_{i,d}, the d-th dim of the i-th entry of points_to_sample.
Return type: array of float64 with shape (num_derivatives, dim)
- compute_grad_variance_of_points(points_to_sample, num_derivatives)[source]¶
Compute the gradient of the variance (matrix) of this GP at each point of Xs (points_to_sample) wrt Xs.
points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.
This function is similar to compute_grad_cholesky_variance_of_points() (below), except this does not include gradient terms from the cholesky factorization. Description will not be duplicated here.
Note
Comments are copied from GaussianProcess in gpp_math.hpp and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess.compute_grad_variance_of_points().
Parameters: - points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP
- num_derivatives (int) – return derivatives wrt points_to_sample[0:num_derivatives]; large or negative values are clamped
Returns: grad_var: gradient of the variance matrix of this GP
Return type: array of float64 with shape (num_derivatives, num_to_sample, num_to_sample, dim)
- compute_mean_of_points(points_to_sample)[source]¶
Compute the mean of this GP at each of point of Xs (points_to_sample).
points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.
Note
Comments are copied from GaussianProcess in gpp_math.hpp and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess.compute_mean_of_points().
Parameters: points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP Returns: mean: where mean[i] is the mean at points_to_sample[i] Return type: array of float64 with shape (num_to_sample)
- compute_variance_of_points(points_to_sample)[source]¶
Compute the variance (matrix) of this GP at each point of Xs (points_to_sample).
points_to_sample may not contain duplicate points. Violating this results in singular covariance matrices.
The variance matrix is symmetric although we currently return the full representation.
Note
Comments are copied from GaussianProcess in gpp_math.hpp and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess.compute_variance_of_points().
Parameters: points_to_sample (array of float64 with shape (num_to_sample, dim)) – num_to_sample points (in dim dimensions) being sampled from the GP Returns: var_star: variance matrix of this GP Return type: array of float64 with shape (num_to_sample, num_to_sample)
- sample_point_from_gp(point_to_sample, noise_variance=0.0)[source]¶
Sample a function value from a Gaussian Process prior, provided a point at which to sample.
Uses the formula function_value = gpp_mean + sqrt(gpp_variance) * w1 + sqrt(noise_variance) * w2, where w1, w2 are draws from N(0,1).
Implementers are responsible for providing a N(0,1) source.
Note
Set noise_variance to 0 if you want “accurate” draws from the GP. BUT if the drawn (point, value) pair is meant to be added back into the GP (e.g., for testing), then this point MUST be drawn with noise_variance equal to the noise associated with “point” as a member of “points_sampled”
Note
Comments are copied from GaussianProcess in gpp_math.hpp and duplicated in moe.optimal_learning.python.cpp_wrappers.gaussian_process.GaussianProcess.sample_point_from_gp().
Parameters: - point_to_sample – point (in dim dimensions) at which to sample from this GP
- noise_variance (float64 >= 0.0) – amount of noise to associate with the sample
Returns: sample_value: function value drawn from this GP
Return type: float64
moe.optimal_learning.python.interfaces.log_likelihood_interface module¶
Interface for computation of log likelihood (and similar) measures of model fit (of a Gaussian Process) along with its gradient and hessian.
As a preface, you should read gpp_math.hpp’s comments first (if not also gpp_math.cpp) to get an overview of Gaussian Processes (GPs) and how we are using them (Expected Improvement, EI). Python readers can get the basic overview in moe.optimal_learning.python.interfaces.gaussian_process_interface.
Note
these comments are copied from the file comments of gpp_model_selection.hpp.
This file deals with model selection via hyperparameter optimization, as the name implies. In our discussion of GPs, we did not pay much attention to the underlying covariance function. We noted that the covariance is extremely important since it encodes our assumptions about the objective function f(x) that we are trying to learn; i.e., the covariance describes the nearness/similarity of points in the input space. Also, the GP was clearly indicated to be a function of the covariance, but we simply assumed that the selection of covariance was an already-solved problem (or even worse, arbitrary!).
MODEL SELECTION
To better understand model selection, let’s look at a common covariance used in our computation, square exponential: cov(x_1, x_2) = \alpha * \exp(-0.5*r^2), where r = \sum_{i=1}^d (x_1_i - x_2_i)^2 / L_i^2. Here, \alpha is \sigma_f^2, the signal variance, and the L_i are length scales. The vector [\alpha, L_1, ... , L_d] are called the “hyperparameters” or “free parameters” (see gpp_covariance.hpp for more details). There is nothing in the covariance that guides the choice of the hyperparameters; L_1 = 0.001 is just as valid as L_1 = 1000.0.
Clearly, the value of the covariance changes substantially if L_i varies by a factor of two, much less 6 orders of magnitude. That is the difference between saying variations of size approx 1.0 in x_i, the first spatial dimension, are extremely important vs almost irrelevant.
So how do we know what hyperparameters to choose? This question/problem is more generally called “Model Selection.” Although the problem is far from solved, we will present the approaches implemented here; as usual, we use Rasmussen & Williams (Chapter 5 now) as a guide/reference.
However, we will not spend much time discussing selection across different classes of covariance functions; e.g., Square Exponential vs Matern w/various \nu, etc. We have yet to develop any experience/intuition with this problem and are temporarily punting it. For now, we follow the observation in Rasmussen & Williams that Square Exponential is a popular choice and appears to work very well. (This is still a very important problem; e.g., there may be scenarios when we would prefer a non-stationary or periodic covariance, and the methods discussed here do not cover this aspect of selection. Such covariance options are not yet implemented though.)
We do note that the techniques for selecting covariance classes more or less require hyperparameter optimization on each individual covariance. The likely approach would be to produce the best fit (according to chosen metrics) using each type of covariance (using optimization) and then choose the best performer across the group.
MODEL SELECTION OVERVIEW
Generally speaking, there are a great many tunable parameters in any model-based learning algorithm. In our case, the GP takes a covariance function as input; the selection of the covariance class as well as the choice of hyperparameters are all part of the model selection process. Determining these details of the [GP] model is the model selection problem.
In order to evaluate the quality of models (and solve model selction), we need some kind of metric. The literature suggests too many to cite, but R&W groups them into three common approaches (5.1, p108):
- compute the probability of the model given the data (e.g., LML)
- estimate the genereralization error (e.g., LOO-CV)
- bound the generalization error
where “generalization error” is defined as “the average error on unseen test examples (from the same distribution as the training cases).” So it’s a measure of how well or poorly the model predicts reality.
For further details and examples of log likelihood measures, see gpp_model_selection.hpp. Overview of some log likelihood measures can be found in moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLogMarginalLikelihood and moe.optimal_learning.python.cpp_wrappers.log_likelihood.GaussianProcessLeaveOneOutLogLikelihood.
OPTIMIZATION
Now that we have discussed measures of model quality, what do we do with them? How do they help us choose hyperparameters?
From here, we can apply anyone’s favorite optimization technique to maximize log likelihoods wrt hyperparameters. The hyperparameters that maximize log likelihood provide the model configuration that is most likely to have produced the data observed so far, (X, f).
In principle, this approach always works. But in practice it is often not that simple. For example, suppose the underlying objective is periodic and we try to optimize hyperparameters for a class of covariance functions that cannot account for the periodicity. We can always* find the set of hyperparameters that maximize our chosen log likelihood measure (LML or LOO-CV), but if the covariance is mis-specified or we otherwise make invalid assumptions about the objective function, then the results are not meaningful at best and misleading at worst. It becomes a case of garbage in, garbage out.
* Even this is tricky. Log likelihood is almost never a convex function. For example, with LML + GPs, you often expect at least two optima, one more complex solution (short length scales, less intrinsic noise) and one less complex solution (longer length scales, higher intrinsic noise). There are even cases where no optima (to machine precision) exist or cases where solutions lie on (lower-dimensional) manifold(s) (e.g., locally the likelihood is (nearly) independent of one or more hyperparameters).
- class moe.optimal_learning.python.interfaces.log_likelihood_interface.GaussianProcessLogLikelihoodInterface[source]¶
Bases: moe.optimal_learning.python.interfaces.gaussian_process_interface.GaussianProcessDataInterface
Interface for computation of log likelihood (and log likelihood-like) measures of model fit along with its gradient and hessian.
See module comments for an overview of log likelihood-like measures of model fit and their role in model selection.
Below, let LL(y | X, \theta) denote the log likelihood of the data (y) given the points_sampled (X) and the hyperparameters (\theta). \theta is the vector that is varied. (X, y) (and associated noise) should be stored as data members by the implementation’s constructor.
See gpp_model_selection.hpp/cpp for further overview and in-depth discussion, respectively.
- compute_grad_log_likelihood()[source]¶
Compute the gradient (wrt hyperparameters) of this log likelihood measure of model fit.
Returns: grad_log_likelihood: i-th entry is \pderiv{LL(y | X, \theta)}{\theta_i} Return type: array of float64 with shape (num_hyperparameters)
- compute_hessian_log_likelihood()[source]¶
Compute the hessian (wrt hyperparameters) of this log likelihood measure of model fit.
See moe.optimal_learning.python.interfaces.covariance_interfaceCovarianceInterface.hyperparameter_hessian_covariance() for data ordering.
Returns: hessian_log_likelihood: (i,j)-th entry is \mixpderiv{LL(y | X, \theta)}{\theta_i}{\theta_j} Return type: array of float64 with shape (num_hyperparameters, num_hyperparameters)
- compute_log_likelihood()[source]¶
Compute a log likelihood measure of model fit.
Returns: value of log_likelihood evaluated at hyperparameters (LL(y | X, \theta)) Return type: float64
- get_hyperparameters()[source]¶
Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.
- hyperparameters¶
Get the hyperparameters (array of float64 with shape (num_hyperparameters)) of this covariance.
moe.optimal_learning.python.interfaces.optimization_interface module¶
Interfaces for optimization (maximization): OptimizableInterface for things that can be optimized and OptimizerInterface to perform the optimization.
First, implementation of these interfaces should be MAXIMIZERS. We also use the term “optima,” and unless we specifically state otherwise, “optima” and “optimization” refer to “maxima” and “maximization,” respectively. (Note that minimizing g(x) is equivalent to maximizing f(x) = -1 * g(x).)
See the file comments for gpp_optimization.hpp for further dicussion of optimization as well as the particular techniques available through the C++ interface.
- class moe.optimal_learning.python.interfaces.optimization_interface.OptimizableInterface[source]¶
Bases: object
Interface that an object must fulfill to be optimized by an implementation of OptimizerInterface.
Below, f(x) is the scalar objective function represented by this object. x is a vector-valued input with problem_size dimensions. With f(x) (and/or its derivatives), a OptimizableInterface implementation can be hooked up to a OptimizerInterface implementation to find the maximum value of f(x) and the input x at which this maximum occurs.
This interface is straightforward–we need the ability to compute the problem size (how many independent parameters to optimize) as well as the ability to compute f(x) and/or its various derivatives. An implementation of f(x) is required; this allows for derivative-free optimization methods. Providing derivatives opens the door to more advanced/efficient techniques (e.g., gradient descent, BFGS, Newton).
This interface is meant to be generic. For example, when optimizing the log marginal likelihood of a GP model (wrt hyperparameters of covariance; e.g., python_version.log_likelihood.GaussianProcessLogMarginalLikelihood) f is the log marginal, x is the vector of hyperparameters, and problem_size is num_hyperparameters. Note that log marginal and covariance both have an associated spatial dimension, and this is NOT problem_size. For Expected Improvement (e.g., python_version.expected_improvement.ExpectedImprovement), f would be the EI, x is the new experiment point (or points) being optimized, and problem_size is dim (or num_points*dim).
TODO(GH-71): getter/setter for current_point.
- compute_grad_objective_function(**kwargs)[source]¶
Compute the gradient of f(current_point) wrt current_point.
Returns: gradient of the objective, i-th entry is \pderiv{f(x)}{x_i} Return type: array of float64 with shape (problem_size)
- compute_hessian_objective_function(**kwargs)[source]¶
Compute the hessian matrix of f(current_point) wrt current_point.
This matrix is symmetric as long as the mixed second derivatives of f(x) are continuous: Clairaut’s Theorem. http://en.wikipedia.org/wiki/Symmetry_of_second_derivatives
Returns: hessian of the objective, (i,j)th entry is \mixpderiv{f(x)}{x_i}{x_j} Return type: array of float64 with shape (problem_size, problem_size)
- compute_objective_function(**kwargs)[source]¶
Compute f(current_point).
Returns: value of objective function evaluated at current_point Return type: float64
- current_point¶
Get the current_point (array of float64 with shape (problem_size)) at which this object is evaluating the objective function, f(x).
- class moe.optimal_learning.python.interfaces.optimization_interface.OptimizerInterface[source]¶
Bases: object
Interface to maximize any object implementing OptimizableInterface (defined above).
Implementations are responsible for tracking an OptimizableInterface subclass (the objective being optimized), a DomainInterface subclass (the domain that the objective lives in), and any parameters needed for controlling optimization behavior*.
- * Examples include iteration counts, tolerances, learning rate, etc. It is suggested that implementers define a
- FooParameters container class for their FooOptimizer implementation of this interface.
- optimize(**kwargs)[source]¶
Maximize a function f(x), represented by an implementation of OptimizableInterface.
The initial guess is set through calling the set_current_point method of this object’s OptimizableInterface data member.
In general, kwargs not specifically consumed by the implementation of optimize() should be passed down to member functions of the optimizable input.
This method is not required to have a return value; implementers may use one for convenience. The optimal point (as determined by optimization) should be available through the OptimizableInterface data member’s get_current_point method.
TODO(GH-59): Pass the best point, fcn value, etc. in thru an IOContainer-like structure.
Module contents¶
Interfaces for structures needed by the optimal_learning package to build Gaussian Process models and optimize the Expected Improvement.
The package comments here will introduce what optimal_learning is attempting to accomplish, provide an outline of Gaussian Processes, and introduce the notion of Expected Improvement and its optimization.
The modules in this package provide the interface with interacting with all the features of optimal_learning.
Note
These comments were copied from the file comments in gpp_math.hpp.
At a high level, this file optimizes an objective function ms f(x)me. This operation requires data/uncertainties about prior and concurrent experiments as well as a covariance function describing how these data [are expected to] relate to each other. The points x represent experiments. If ms f(x)me is say, survival rate for a drug test, the dimensions of x might include dosage amount, dosage frequency, and overall drug-use time-span.
The objective function is not required in closed form; instead, only the ability to sample it at points of interest is needed. Thus, the optimization process cannot work with ms f(x)me directly; instead a surrogate is built via interpolation with Gaussian Proccesses (GPs).
Following Rasmussen & Williams (2.2), a Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Defn 2.1). Hence a GP is fully specified by its mean function, ms m(x)me, and covariance function, ms k(x,x’)me. Then we assume that a real process ms f(x)me (e.g., drug survival rate) is distributed like:
with
Then sampling from ms f(x)me is simply drawing from a Gaussian with the appropriate mean and variance.
However, since we do not know ms f(x)me, we cannot precisely build its corresponding GP. Instead, using samples from ms f(x)me (e.g., by measuring experimental outcomes), we can iteratively improve our estimate of ms f(x)me. See GaussianProcessInterface class docs and implementation docs for details on how this is done.
The optimization process models the objective using a Gaussian process (GP) prior (also called a GP predictor) based on the specified covariance and the input data (e.g., through member functions compute_mean_of_points, compute_variance_of_points). Using the GP, we can compute the expected improvement (EI) from sampling any particular point. EI is defined relative to the best currently known value, and it represents what the algorithm believes is the most likely outcome from sampling a particular point in parameter space (aka conducting a particular experiment).
See ExpectedImprovementInterface ABC and implementation docs for further details on computing EI. Both support compute_expected_improvement() and compute_grad_expected_improvement().
The dimension of the GP is equal to the number of simultaneous experiments being run; i.e., the GP may be multivariate. The behavior of the GP is controlled by its underlying covariance function and the data/uncertainty of prior points (experiments).
With the ability the compute EI, the final step is to optimize to find the best EI. This is done using multistart gradient descent (MGD), in multistart_expected_improvement_optimization(). This method wraps a MGD call and falls back on random search if that fails. See gpp_optimization.hpp for multistart/optimization templates. This method can evaluate and optimize EI at serval points simultaneously; e.g., if we wanted to run 4 simultaneous experiments, we can use EI to select all 4 points at once.
The literature (e.g., Ginsbourger 2008) refers to these problems collectively as q-EI, where q is a positive integer. So 1-EI is the originally dicussed usage, and the previous scenario with multiple simultaneous points/experiments would be called 4-EI.
Additionally, there are use cases where we have existing experiments that are not yet complete but we have an opportunity to start some new trials. For example, maybe we are a drug company currently testing 2 combinations of dosage levels. We got some new funding, and can now afford to test 3 more sets of dosage parameters. Ideally, the decision on the new experiments should depend on the existence of the 2 ongoing tests. We may not have any data from the ongoing experiments yet; e.g., they are [double]-blind trials. If nothing else, we would not want to duplicate any existing experiments! So we want to solve 3-EI using the knowledge of the 2 ongoing experiments.
We call this q,p-EI, so the previous example would be 3,2-EI. The q-EI notation is equivalent to q,0-EI; if we do not explicitly write the value of p, it is 0. So q is the number of new (simultaneous) experiments to select. In code, this would be the size of the output from EI optimization (i.e., best_points_to_sample, of which there are q = num_to_sample points). p is the number of ongoing/incomplete experiments to take into account (i.e., points_being_sampled of which there are p = num_being_sampled points).
Back to optimization: the idea behind gradient descent is simple. The gradient gives us the direction of steepest ascent (negative gradient is steepest descent). So each iteration, we compute the gradient and take a step in that direction. The size of the step is not specified by GD and is left to the specific implementation. Basically if we take steps that are too large, we run the risk of over-shooting the solution and even diverging. If we take steps that are too small, it may take an intractably long time to reach the solution. Thus the magic is in choosing the step size; we do not claim that our implementation is perfect, but it seems to work reasonably. See gpp_optimization.hpp for more details about GD as well as the template definition.
For particularly difficult problems or problems where gradient descent’s parameters are not well-chosen, GD can fail to converge. If this happens, we can fall back on heuristics; e.g., ‘dumb’ search (i.e., evaluate EI at a large number of random points and take the best one). This functionality is accessed through: multistart_expected_improvement_optimization().
- And domain-specific notation, following Rasmussen, Williams:
- X = points_sampled; this is the training data (size dim X num_sampled), also called the design matrix
- Xs = points_to_sample; this is the test data (size dim X num_to_sample``)
- y, f, f(x) = points_sampled_value, the experimental results from sampling training points
- K, K_{ij}, K(X,X) = covariance(X_i, X_j), covariance matrix between training inputs (num_sampled x num_sampled)
- Ks, Ks_{ij}, K(X,Xs) = covariance(X_i, Xs_j), covariance matrix between training and test inputs (num_sampled x num_to_sample)
- Kss, Kss_{ij}, K(Xs,Xs) = covariance(Xs_i, Xs_j), covariance matrix between test inputs (num_to_sample x num_to_sample)
- \theta: (vector) of hyperparameters for a covariance function
Note
Due to confusion with multiplication (K_* looks awkward in code comments), Rasmussen & Williams’ ms K_*me notation has been repalced with Ks and ms K_{**}me is Kss.
Connecting to the q,p-EI notation, both the points represented by “q” and “p” are represented by Xs. Within the GP, there is no distinction between points being sampled by ongoing experiments and new points to sample.