Efficient Global Optimization

Efficient Global Optimization (EGO) was developed to facilitate the unconstrained minimization of expensive implicit response functions. The method builds an initial Gaussian process model as a global surrogate for the response function, then intelligently selects additional samples to be added for inclusion in a new Gaussian process model in subsequent iterations. The new samples are selected based on how much they are expected to improve the current best solution to the optimization problem. When this expected improvement is acceptably small, the globally optimal solution has been found. The application of this methodology to equality-constrained reliability analysis is the primary contribution of EGRA.

Efficient global optimization was originally proposed by Jones et al. [JSW98] and has been adapted into similar methods such as sequential kriging optimization (SKO) [HANZ06]. The main difference between SKO and EGO lies within the specific formulation of what is known as the expected improvement function (EIF), which is the feature that sets all EGO/SKO-type methods apart from other global optimization methods. The EIF is used to select the location at which a new training point should be added to the Gaussian process model by maximizing the amount of improvement in the objective function that can be expected by adding that point. A point could be expected to produce an improvement in the objective function if its predicted value is better than the current best solution, or if the uncertainty in its prediction is such that the probability of it producing a better solution is high. Because the uncertainty is higher in regions of the design space with fewer observations, this provides a balance between exploiting areas of the design space that predict good solutions, and exploring areas where more information is needed.

The general procedure of these EGO-type methods is:

Build an initial Gaussian process model of the objective function.
Find the point that maximizes the EIF. If the EIF value at this point is sufficiently small, stop.
Evaluate the objective function at the point where the EIF is maximized. Update the Gaussian process model using this new point. Go to Step 2.

The following sections discuss the construction of the Gaussian process model used, the form of the EIF, and then a description of how that EIF is modified for application to reliability analysis.

Gaussian Process Model

Gaussian process (GP) models are set apart from other surrogate models because they provide not just a predicted value at an unsampled point, but also an estimate of the prediction variance. This variance gives an indication of the uncertainty in the GP model, which results from the construction of the covariance function. This function is based on the idea that when input points are near one another, the correlation between their corresponding outputs will be high. As a result, the uncertainty associated with the model’s predictions will be small for input points which are near the points used to train the model, and will increase as one moves further from the training points.

It is assumed that the true response function being modeled \(G({\bf u})\) can be described by: [Cre91]

\[G({\bf u})={\bf h}({\bf u})^T{\boldsymbol \beta} + Z({\bf u})\]

where \({\bf h}()\) is the trend of the model, \({\boldsymbol \beta}\) is the vector of trend coefficients, and \(Z()\) is a stationary Gaussian process with zero mean (and covariance defined below) that describes the departure of the model from its underlying trend. The trend of the model can be assumed to be any function, but taking it to be a constant value has been reported to be generally sufficient. [SSW89] For the work presented here, the trend is assumed constant and \({\boldsymbol \beta}\) is taken as simply the mean of the responses at the training points. The covariance between outputs of the Gaussian process \(Z()\) at points \({\bf a}\) and \({\bf b}\) is defined as:

(258)\[Cov \left[ Z({\bf a}),Z({\bf b}) \right] = \sigma_Z^2 R({\bf a},{\bf b})\]

where \(\sigma_Z^2\) is the process variance and \(R()\) is the correlation function. There are several options for the correlation function, but the squared-exponential function is common [SSW89], and is used here for \(R()\):

\[R({\bf a},{\bf b}) = \exp \left[ -\sum_{i=1}^d \theta_i(a_i - b_i)^2 \right]\]

where \(d\) represents the dimensionality of the problem (the number of random variables), and \(\theta_i\) is a scale parameter that indicates the correlation between the points within dimension \(i\). A large \(\theta_i\) is representative of a short correlation length.

The expected value \(\mu_G()\) and variance \(\sigma_G^2()\) of the GP model prediction at point \({\bf u}\) are:

(259)\[\begin{aligned} \mu_G({\bf u}) &= {\bf h}({\bf u})^T{\boldsymbol \beta} + {\bf r}({\bf u})^T{\bf R}^{-1}({\bf g} - {\bf F}{\boldsymbol \beta}) \end{aligned}\]

(260)\[\begin{split}\begin{aligned} \sigma_G^2({\bf u}) &= \sigma_Z^2 - \begin{bmatrix} {\bf h}({\bf u})^T & {\bf r}({\bf u})^T \end{bmatrix} \begin{bmatrix} {\bf 0} & {\bf F}^T \\ {\bf F} & {\bf R} \end{bmatrix}^{-1} \begin{bmatrix} {\bf h}({\bf u}) \\ {\bf r}({\bf u}) \end{bmatrix} \end{aligned}\end{split}\]

where \({\bf r}({\bf u})\) is a vector containing the covariance between \({\bf u}\) and each of the \(n\) training points (defined by Eq. (258)), \({\bf R}\) is an \(n \times n\) matrix containing the correlation between each pair of training points, \({\bf g}\) is the vector of response outputs at each of the training points, and \({\bf F}\) is an \(n \times q\) matrix with rows \({\bf h}({\bf u}_i)^T\) (the trend function for training point \(i\) containing \(q\) terms; for a constant trend \(q\!=\!1\)). This form of the variance accounts for the uncertainty in the trend coefficients \(\boldsymbol \beta\), but assumes that the parameters governing the covariance function (\(\sigma_Z^2\) and \(\boldsymbol \theta\)) have known values.

The parameters \(\sigma_Z^2\) and \({\boldsymbol \theta}\) are determined through maximum likelihood estimation. This involves taking the log of the probability of observing the response values \({\bf g}\) given the covariance matrix \({\bf R}\), which can be written as: [SSW89]

(261)\[\log \left[ p({\bf g} | {\bf R}) \right] = -\frac{1}{n} \log \lvert{\bf R}\rvert - \log(\hat{\sigma}_Z^2)\]

where \(\lvert {\bf R} \rvert\) indicates the determinant of \({\bf R}\), and \(\hat{\sigma}_Z^2\) is the optimal value of the variance given an estimate of \(\boldsymbol \theta\) and is defined by:

\[\hat{\sigma}_Z^2 = \frac{1}{n}({\bf g}-{\bf F}{\boldsymbol \beta})^T {\bf R}^{-1}({\bf g}-{\bf F}{\boldsymbol \beta})\]

Maximizing Eq. (261) gives the maximum likelihood estimate of \(\boldsymbol \theta\), which in turn defines \(\sigma_Z^2\).

Acquisition Functions

The acquisition function determines the location of the next sampling point or refinement points, in the sense that maximizing the acquisition function yields the next sampling point, as

\[{\bf u}^* = \underset{\bf u}{\text{argmax}}~a({\bf u}).\]

Expected Improvement

The expected improvement function is used to select the location at which a new training point should be added. The EIF is defined as the expectation that any point in the search space will provide a better solution than the current best solution based on the expected values and variances predicted by the GP model. An important feature of the EIF is that it provides a balance between exploiting areas of the design space where good solutions have been found, and exploring areas of the design space where the uncertainty is high. First, recognize that at any point in the design space, the GP prediction \(\hat{G}()\) is a Gaussian distribution:

\[\hat{G}({\bf u}) \sim \mathcal{N}\left( \mu_G({\bf u}), \sigma_G({\bf u}) \right)\]

where the mean \(\mu_G()\) and the variance \(\sigma_G^2()\) were defined in Eqs. (259) and (260), respectively. The EIF is defined as: [JSW98]

\[EI\bigl( \hat{G}({\bf u}) \bigr) \equiv E\left[ \max \left( G({\bf u}^*) - \hat{G}({\bf u}),0 \right) \right]\]

where \(G({\bf u}^*)\) is the current best solution chosen from among the true function values at the training points (henceforth referred to as simply \(G^*\)). This expectation can then be computed by integrating over the distribution \(\hat{G}({\bf u})\) with \(G^*\) held constant:

(262)\[EI\bigl( \hat{G}({\bf u}) \bigr) = \int_{-\infty}^{G^*} \left( G^* - G \right) \, \hat{G}({\bf u}) \; dG\]

where \(G\) is a realization of \(\hat{G}\). This integral can be expressed analytically as: [JSW98]

(263)\[EI\bigl( \hat{G}({\bf u}) \bigr) = \left( G^* - \mu_G \right) \, \Phi\left( \frac{G^* - \mu_G}{\sigma_G} \right) + \sigma_G \, \phi\left( \frac{G^* - \mu_G}{\sigma_G} \right)\]

where it is understood that \(\mu_G\) and \(\sigma_G\) are functions of \({\bf u}\). Rewritting in a more compact manner and dropping the subscript \(_G\),

(264)\[a_\text{EI}({\bf u}, \{{\bf u}_i,y_i \}_{i=1}^N,\theta)) = \sigma({\bf u}) \cdot( \gamma({\bf u}) \Phi(\gamma({\bf u}) ) + \phi(\gamma({\bf u})) ),\]

where \(\gamma({\bf u}) = \frac{G^* - \mu({\bf u})}{\sigma({\bf u})}\). This equation defines the expected improvement acquisition function for an unknown \({\bf u}\).

The point at which the EIF is maximized is selected as an additional training point. With the new training point added, a new GP model is built and then used to construct another EIF, which is then used to choose another new training point, and so on, until the value of the EIF at its maximized point is below some specified tolerance. In Ref. [HANZ06] this maximization is performed using a Nelder-Mead simplex approach, which is a local optimization method. Because the EIF is often highly multimodal [JSW98] it is expected that Nelder-Mead may fail to converge to the true global optimum. In Ref. [JSW98], a branch-and-bound technique for maximizing the EIF is used, but was found to often be too expensive to run to convergence. In Dakota, an implementation of the DIRECT global optimization algorithm is used [Gab01].

It is important to understand how the use of this EIF leads to optimal solutions. Eq. (263) indicates how much the objective function value at \({\bf x}\) is expected to be less than the predicted value at the current best solution. Because the GP model provides a Gaussian distribution at each predicted point, expectations can be calculated. Points with good expected values and even a small variance will have a significant expectation of producing a better solution (exploitation), but so will points that have relatively poor expected values and greater variance (exploration).

The application of EGO to reliability analysis, however, is made more complicated due to the inclusion of equality constraints (see Eqs. (58)- (59)). For inverse reliability analysis, this extra complication is small. The response being modeled by the GP is the objective function of the optimization problem (see Eq. (59)) and the deterministic constraint might be handled through the use of a merit function, thereby allowing EGO to solve this equality-constrained optimization problem. Here the problem lies in the interpretation of the constraint for multimodal problems as mentioned previously. In the forward reliability case, the response function appears in the constraint rather than the objective. Here, the maximization of the EIF is inappropriate because feasibility is the main concern. This application is therefore a significant departure from the original objective of EGO and requires a new formulation. For this problem, the expected feasibility function is introduced.

Probability Improvement Acquisition Function

The probability of improvement (PI) acquisition function is proposed by [Kus64], using the same argument that the GP prediction is a Gaussian distribution. Similar to Equation (264), the PI acquisition function is given by

\[a_{\text{PI}}({\bf u}) = \Phi(\gamma({\bf u})).\]

Generally speaking, the EI acquisition function performs better than the PI acquisition function.

Lower-Confidence Bound Acquisition Function

Another form of acquisition is lower-confidence bound (LCB), proposed recently by Srinivas et al. [SKKS09, SKKS12], which has shown to perform very well. The LCB acquisition function takes the form of

\[a_{\text{LCB}}({\bf u}) = - \mu({\bf u}) + \kappa \sigma({\bf u}),\]

where \(\kappa\) is a hyper-parameter describing the acquisition exploitation-exploration balance. In many cases in design optimization, \(\kappa = 2\) is preferred, but relaxing this \(\kappa\) as a function of iterations is also possible, cf. Daniel et al. [DVM+14], as

\[\kappa = \sqrt{\nu \gamma_n},\quad \nu = 1, \quad \gamma_n = 2\log{\left(\frac{N^{d/2 + 2}\pi^2}{3\delta} \right)},\]

and \(d\) is the dimensionality of the problem, and \(\delta \in (0,1)\) [SKKS12].

Batch-sequential parallel

The batch-sequential parallelization is mainly motivated by exploiting the computational resource, where multiple sampling points \({\bf u}\) can be queried concurrently on a high-performance computing platform. The benefit of batch implementation is that the physical time to converge to the optimal solution is significantly reduced with a factor of \(\sqrt{K}\), where \(K\) is the batch size. While there are many flavors of batch-sequential parallelization, as well as asynchronous parallelization in EGO and Bayesian optimization, we mainly review the theory of GP-BUCB by Desautels et al. [DKB14], GP-UCB-PE by Contal et al [CBRV13], and pBO-2GP-3B by Tran et al [TSF+19]. The parallelization feature of EGO is sometimes referred to as lookahead or non-myopic Bayesian optimization in the literature, especially in the machine learning community.

The approach by Desautels et al. [DKB14] mainly advocates for the “hallucination” scheme or heuristic liar, in which the unknown observation at the currently querying sampling point \({\bf u}^*\) is temporarily assumed as the posterior mean \(\mu({\bf u}^*)\). Then, the underlying GP model updates based on this assumption and locates other points in the same batch, until the batch is filled. After the whole batch is constructed, it is then queried, and all the responses are received at once when the batch is completed. Contal et al. [CBRV13] extended from the work of Desautels et al. [DKB14] and proved that including pure exploration (i.e. sampling at \({\bf u}^*\) where \(\sigma({\bf u})\) is maximum) increases the efficiency. Tran et al. [TSF+19] adopted two aforementioned approaches and extended for known and unknown constraints.

The asynchronous batch parallel EGO is implemented based on the idea of further leveraging computational efficiency when the computational query cost varies widely. In this scenario, the batch-sequential parallel EGO finishes one iteration when the last worker of the batch finishes. This mechanism makes the other workers, which might have finished the jobs or simulations earlier, wait for the last worker to finish, thus creating an unnecessary idle period. The asynchronous batch parallel scheme is, therefore, created to accelerate the optimization process by immediately assigning the next jobs to workers that have finished earlier jobs, without waiting for each other. When workers finish one query, the objective GP is updated, and the next sampling point is found by maximizing the acquisition function. Numerical comparison results are shown in one of our previous works [TEW+22], across a number of numerical functions and some engineering simulations as well.