Surrogate Models
This chapter deals with the theory behind Dakota’s surrogate models, which are also known as response surfaces and meta-models.
Kriging and Gaussian Process Models
In this discussion of Kriging and Gaussian Process (GP) models, vectors
are indicated by a single underline and matrices are indicated by a
double underline. Capital letters indicate data, or functions of data,
that is used to construct an emulator. Lower case letters indicate
arbitrary points, i.e. points where the simulator may or may not have
been evaluated, and functions of arbitrary points. Estimates,
approximations, and models are indicated by hats. For instance,
Kriging & Gaussian Processes: Function Values Only
The set of interpolation techniques known as Kriging, also referred to as Gaussian Processes, were originally developed in the geostatistics and spatial statistics communities to produce maps of underground geologic deposits based on a set of widely and irregularly spaced borehole sites [Cre91]. Building a Kriging model typically involves the
Choice of a trend function,
Choice of a correlation function, and
Estimation of correlation parameters.
A Kriging emulator,
This represents an estimated distribution for the unknown true surface,
Here
where
and
By convention, the terms simple Kriging, ordinary Kriging, and universal
Kriging are used to indicate the three most common choices for the trend
function. In simple Kriging, the trend is treated as a known constant,
usually zero,
For a finite number of sample points,
Here
There are many options for
Powered-Exponential
(214)where
and .Matern
where
, , and is the modified Bessel function of order ; is the smallest value of which results in a Kriging model that is times differentiable.Cauchy
where
, , and .
Gneiting et al. [GGG07] provide a more thorough discussion of the properties of and relationships between these three families. Some additional correlation functions include the Dagum family [BMP08] and cubic splines.
The squared exponential or Gaussian correlation function
(Equation (214) with
Here, the correlation lengths,
Here,
Ill-conditioning of
In the context of estimating the optimal
Rennen [Ren09] advocated that ill-conditioning be handled by building Kriging models from a uniform subset of available sample points. That option has been available in Dakota’s “Gaussian process” model (a separate implementation from Dakota’s “Kriging” model) since version 4.1 [EAG+07]. Note that Kriging/Gaussian-Process models will not exactly interpolate the discarded points. The implicit justification for this type of approach is that the row or columns of an ill-conditioned matrix contain a significant amount of duplicate information, and that when discarded, duplicate information should be easy to predict.
Dakota’s surfpack
model has a similar “discard
near duplicate points” capability. However, it explicitly addresses the
issue of unique information content. Points are not discarded prior
to the construction of the Kriging model. Instead, for each vector
Adding a nugget,
Choosing a nugget based on the variance of measurement error (if any); this will be an iterative process if
is not known in advance.Iteratively adding a successively larger nugget until
is no longer ill-conditioned.Exactly calculating the minimum nugget needed for a target condition number from
’s maximum and minimum eigenvalues. The condition number of is . However, calculating eigenvalues is computationally expensive. Since Kriging’s matrix has all ones on the diagonal, its trace and therefore sum of eigenvalues is . Consequently, a nugget value of will always alleviate ill-conditioning. A smaller nugget that is also guaranteed to alleviate ill-conditioning can be calculated from LAPACK’s fast estimate of the reciprocal of ’s condition number, .Treating
as another parameter to be selected by the same process used to choose . Two such approaches are discussed below.
The Kriging model’s adjusted variance is commonly used as a spatially varying measure of uncertainty. Knowing where, and by how much, the model “doubts” its own predictions helps build user confidence in the predictions and can be utilized to guide the selection of new sample points during optimization or to otherwise improve the surrogate. The adjusted variance is
where the maximum likelihood estimate of the unadjusted variance is
There are two types of numerical approaches to choosing
The other, more common, approach to constructing a Kriging model
involves using optimization to find the set of correlation parameters
gaussian_process
and
surfpack
models use
the maximum likelihood approach. It is equivalent, and more convenient
to maximize the natural logarithm of the likelihood, which assuming a
vague prior is,
And, if one substitutes the maximum likelihood estimate
Because of the division by
Note that the determinant of
Also note, that in the absence of constraints, maximizing the likelihood
would result in singular surfpack
models.
The first of these is an explicit constraint on LAPACK’s fast estimate
of the (reciprocal of the) condition number,
The second, is a box constraint defining a small “feasible” region in
correlation length space to search during the maximum likelihood
optimization. Many global optimizers, such as the DIRECT (DIvision of
RECTangles) used by Dakota’s Gaussian Process (as the only option) and
Kriging (as the default option) models, require a box constraint
definition for the range of acceptable parameters. By default, Dakota’s
surfpack
model defines the input space to be the smallest
hyper-rectangle that contains the sample design. The user has the option
to define a larger input space that includes a region where they wish to
extrapolate. Note that the emulator can be evaluated at points outside
the defined input space, but this definition helps to determine the
extent of the “feasible” region of correlation lengths. Let the input
space be normalized to the unit hypercube centered at the origin. The
average distance between nearest neighboring points is then
Dakota’s “feasible” range of correlation lengths,
This range was chosen based on correlation lengths being analogous to
the standard deviation in the Gaussian or Normal distribution. If the
correlation lengths are set to
Gradient Enhanced Kriging
This section focuses on the incorporation of derivative information into Kriging models and challenges in their implementation. Special attention is paid to conditioning issues.
There are at least three basic approaches for incorporating derivative information into Kriging. These are
Indirect: The sample design is augmented with fictitious points nearby actual sample points which are predicted from derivative information and then a Kriging model is built from the augmented design.
Co-Kriging: The derivatives with respect to each input variables are treated as separate but correlated output variables and a Co-Kriging model is built for the set of output variables. This would use
vectors.Direct: The relationship between the response value and its derivatives is leveraged to use a single
by assuming(218)
Dakota includes an implementation of the direct approach,
herein referred to simply as Gradient Enhanced (universal) Kriging
(GEK). The equations for GEK can be derived by assuming Equation
(218) and then taking the same
steps used to derive function value only Kriging. The superscript on
Here capital
and has units of
A straight-forward implementation of GEK tends be significantly more accurate than Kriging given the same sample design provided that the
Derivatives are accurate
Derivatives are not infinite (or nearly so)
Function is sufficiently smooth, and
is not ill-conditioned (this can be problematic).
If gradients can be obtained cheaply (e.g. by automatic differentiation
or adjoint techniques) and the previous conditions are met, GEK also
tends to outperform Kriging for the same computational budget. Previous
works, such as Dwight [DH09], state that the direct
approach to GEK is significantly better conditioned than the indirect
approach. While this is true, (direct) GEK’s
In the literature, ill-conditioning is often attributed to the choice of the correlation function. Although a different correlation function may alleviate the ill-conditioning for some problems, the root cause of the ill-conditioning is a poorly spaced sample design. Furthermore, a sufficiently bad sample design could make any interpolatory Kriging model, gradient enhanced or otherwise, ill-conditioned, regardless of the choice of correlation function. This root cause can be addressed directly by discarding points/equations.
Discarding points/equations is conceptually similar to using a
Moore-Penrose pseudo inverse of
An alternate strategy is to discard additional copies of the information
that is most duplicated and keep more of the barely present information.
In the context of eigenvalues, this can be described as decreasing the
maximum eigenvalue and increasing the minimum eigenvalue by a smaller
amount than a pseudo inverse. The result is that the GEK model will
exactly fit all of the retained information. This can be achieved using
a pivoted Cholesky factorization, such as the one developed by Lucas
[Luc04] to determine a reordering
In benchmarking tests, Lucas’ level 3 pivoted Cholesky implementation was not competitive with the level 3 LAPACK non-pivoted Cholesky in terms of computational efficiency. In some cases, it was an order of magnitude slower. Note that Lucas’ level 3 implementation can default to his level 2 implementation and this may explain some of the loss in performance.
More importantly, applying pivoted Cholesky to

Fig. 82 A diagram with pseudo code for the pivoted Cholesky algorithm used to
select the subset of equations to retain when
To address computational efficiency and robustness, Dakota’s pivoted Cholesky approach for GEK was modified to:
Equilibrate
to improve the accuracy of the Cholesky factorization; this is beneficial because can be poorly scaled. Theorem 4.1 of van der Sluis [vdS69] states that if is a real, symmetric, positive definite by matrix and the diagonal matrix contains the square roots of the diagonals of , then the equilibrationminimizes the 2-norm condition number of
(with respect to solving linear systems) over all such symmetric scalings, to within a factor of . The equilibrated matrix will have all ones on the diagonal.Perform pivoted Cholesky on
, instead of , to rank points according to how much new information they contain. This ranking was reflected by the ordering of points in .Apply the ordering of points in
to whole points in to produce . Here a whole point means the function value at a point immediately followed by the derivatives at the same point.Perform a LAPACK non-pivoted Cholesky on the equilibrated
and drop equations off the end until it satisfies the constraint on rcond. LAPACK’s rcond estimate requires the 1-norm of the original (reordered) matrix as input so the 1-norms for all possible sizes of are precomputed (using a rank one update approach) and stored prior to the Cholesky factorization. A bisection search is used to efficiently determine the number of equations that need to be retained/discarded. This requires or fewer evaluations of rcond. These rcond calls are all based off the same Cholesky factorization of but use different numbers of rows/columns, .
This algorithm is visually depicted in
Fig. 82. Because inverting/factorizing a
matrix with
Experimental Gaussian Process
The experimental semi-parametric Gaussian process (GP) regression model
in Dakota’s surrogates module contains two types of variables:
hyperparameters
We consider an anisotropic squared exponential kernel
The parameter
As of Dakota 6.14, the Matérn
All of the kernel hyperparameters are strictly positive except for the
nugget which is non-negative and will sometimes be omitted or set to a
fixed value. These parameters may vary over several orders of magnitude
such that variable scaling can have a significant effect on the
performance of the optimizer used to find
with
where the hyperparameters are ordered so that the log-transformed
variables
An important quantity in Gaussian process regression is the Gram
matrix
where if the layout of the surrogate build samples matrix
The derivatives (shown only for the squared exponential and white/nugget
kernels here) of the Gram matrix with respect to the hyperparameters are
used in the regression procedure. To compute them it is helpful to
partition the Gram matrix into two components, one that depends on the
squared exponential piece of the kernel
Let
The derivatives of
where
A polynomial trend function is obtained by multiplying a basis matrix
for the samples
The gradient of the objective function can be computed analytically using the derivatives of the Gram matrix. First we introduce
The gradient is
We use the Rapid Optimization Library’s [KRvBWvW14]
gradient-based, bound-constrained implementation of the optimization
algorithm L-BFGS-B [BLNZ95] to minimize
Once the regression problem has been solved, we may wish to evaluate the
GP at a set of predictions points
The mean and covariance of the GP at the prediction points are
Polynomial Models
A preliminary discussion of the surrogate polynomial models available in
Dakota is presented in the main Surrogate Models section,
with select details reproduced here. For ease of notation, the discussion
in this section assumes that the model returns a single response
For each point
a quadratic polynomial model is
and a cubic polynomial model is
In these equations,
Let the matrix
where
where
Additional discussion and detail can be found in [NWK85].