gpp_optimizer_parameters

Contents:

gpp_optimizer_parameters.hpp

This file specifies OptimizerParameters structs (e.g., GradientDescent, Newton) for holding values that control the behavior of the optimizers in gpp_optimization.hpp. For example, max step sizes, number of iterations, step size control, etc. are all specified through these structs.

These structs also specify multistart behavior pertaining to the multistart optimization code in gpp_math and gpp_model_selection.

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.

Enums

OptimizerTypes enum

Enum for the various optimizer types. Convenient for specifying which optimizer to use in testing and also used by the Python interface to specify the optimizer (e.g., for EI and hyperparameter optimization).

Values:

  • kNull = = 0 -

    NullOptimizer<>, used for evaluating objective at points.

  • kGradientDescent = = 1 -

    GradientDescentOptimizer<>

  • kNewton = = 2 -

    NewtonOptimizer<>

class GradientDescentParameters

Container to hold parameters that specify the behavior of Gradient Descent.

Note

these comments are copied in build_gradient_descent_parameters() in cpp_wrappers/optimizer_parameters.py. That function wraps this struct’s ctor.

Iterations

The total number of gradient descent steps is at most num_multistarts * max_num_steps * max_num_restarts Generally, allowing more iterations leads to a better solution but costs more time.

Averaging (TODO(GH-390): NOT IMPLEMTED YET)

When optimizing stochastic objective functions, it can often be beneficial to average some number of gradient descent steps to obtain the final result (vs just returning the last step). Polyak-Ruppert averaging: postprocessing step where we replace x_n with: \overbar{x} = \frac{1}{n - n_0} \sum_{t=n_0 + 1}^n x_t n_0 = 0 averages all steps; n_0 = n - 1 is equivalent to returning x_n directly. Here, num_steps_averaged is n - n_0.

  • num_steps_averaged < 0: averages all steps
  • num_steps_averaged == 0: do not average
  • num_steps_averaged > 0 and <= max_num_steps: average the specified number of steps
  • max_steps_averaged > max_num_steps: average all steps

Learning Rate

GD may be implemented using a learning rate: pre_mult * (i+1)^{-\gamma}, where i is the current iteration Larger gamma causes the GD step size to (artificially) scale down faster. Smaller pre_mult (artificially) shrinks the GD step size. Generally, taking a very large number of small steps leads to the most robustness; but it is very slow.

Tolerances

Larger relative changes are potentially less robust but lead to faster convergence. Large tolerances run faster but may lead to high errors or false convergence (e.g., if the tolerance is 1.0e-3 and the learning rate control forces steps to fall below 1.0e-3 quickly, then GD will quit “successfully” without genuinely converging.)

Public Functions

GradientDescentParameters()

GradientDescentParameters(int num_multistarts_in, int max_num_steps_in, int max_num_restarts_in, int num_steps_averaged_in, double gamma_in, double pre_mult_in, double max_relative_change_in, double tolerance_in)

Construct a GradientDescentParameters object. Default, copy, and assignment constructor are disallowed.

INPUTS: See member declarations below for a description of each parameter.

GradientDescentParameters(GradientDescentParameters && OL_UNUSED)

Public Members

int num_multistarts

number of initial guesses to try in multistarted gradient descent (suggest: a few hundred)

int max_num_steps

maximum number of gradient descent iterations per restart (suggest: 200-1000)

int max_num_restarts

maximum number of gradient descent restarts, the we are allowed to call gradient descent. Should be >= 2 as a minimum (suggest: 4-20)

int num_steps_averaged

number of steps to use in polyak-ruppert averaging (see above) (suggest: 10-50% of max_num_steps for stochastic problems, 0 otherwise)

double gamma

exponent controlling rate of step size decrease (see struct docs or GradientDescentOptimizer) (suggest: 0.5-0.9)

double pre_mult

scaling factor for step size (see struct docs or GradientDescentOptimizer) (suggest: 0.1-1.0)

double max_relative_change

max change allowed per GD iteration (as a relative fraction of current distance to wall) (suggest: 0.5-1.0 for less sensitive problems like EI; 0.02 for more sensitive problems like hyperparameter opt)

double tolerance

when the magnitude of the gradient falls below this value OR we will not move farther than tolerance (e.g., at a boundary), stop. (suggest: 1.0e-7)

class NewtonParameters

Container to hold parameters that specify the behavior of Newton.

Note

these comments are copied in build_newton_parameters() in cpp_wrappers/optimizer_parameters.py. That function wraps this struct’s ctor.

Diagonal dominance control: ``gamma`` and ``time_factor`` On i-th newton iteration, we add 1/(time_factor*gamma^{i+1}) * I to the Hessian to improve robustness

Choosing a small gamma (e.g., 1.0 < gamma <= 1.01) and time_factor (e.g., 0 < time_factor <= 1.0e-3) leads to more consistent/stable convergence at the cost of slower performance (and in fact for gamma or time_factor too small, gradient descent is preferred). Conversely, choosing more aggressive values may lead to very fast convergence at the cost of more cases failing to converge.

gamma = 1.01, time_factor = 1.0e-3 should lead to good robustness at reasonable speed. This should be a fairly safe default. gamma = 1.05, time_factor = 1.0e-1 will be several times faster but not as robust. for “easy” problems, these can be much more aggressive, e.g., gamma = 2.0, time_factor = 1.0e1 or more.

Public Functions

NewtonParameters()

NewtonParameters(int num_multistarts_in, int max_num_steps_in, double gamma_in, double time_factor_in, double max_relative_change_in, double tolerance_in)

Construct a NewtonParameters object. Default, copy, and assignment constructor are disallowed.

INPUTS: See member declarations below for a description of each parameter.

NewtonParameters(NewtonParameters && OL_UNUSED)

Public Members

int num_multistarts

number of initial guesses for multistarting (suggest: a few hundred)

int max_num_steps

maximum number of newton iterations (per initial guess) (suggest: 100)

const int max_num_restarts

maximum number of newton restarts (fixed; not used by newton)

double gamma

exponent controlling rate of time_factor growth (see class docs and NewtonOptimizer) (suggest: 1.01-1.1)

double time_factor

initial amount of additive diagonal dominance (see class docs and NewtonOptimizer) (suggest: 1.0e-3-1.0e-1)

double max_relative_change

max change allowed per update (as a relative fraction of current distance to wall) (Newton may ignore this) (suggest: 1.0)

double tolerance

when the magnitude of the gradient falls below this value, stop (suggest: 1.0e-10)

class NullParameters

Empty container for optimizers that do not require any parameters (e.g., the null optimizer).