This chapter introduces several fundamental concepts related to sampling
methods. In particular, the statistical properties of the Monte Carlo
estimator are discussed (Monte Carlo (MC)) and
strategies for multilevel and multifidelity sampling are introduced
within this context. Hereafter, multilevel refers to the possibility of
exploiting distinct discretization levels (i.e. space/time resolution)
within a single model form, whereas multifidelity involves the use of
more than one model form. In Multifidelity Monte Carlo,
we describe the multifidelity Monte Carlo and its single fidelity model version, the control variate Monte Carlo,
that we align with
multifidelity sampling, and in Multilevel Monte Carlo, we
describe the multilevel Monte Carlo algorithm that we align with
multilevel sampling. In A multilevel-multifidelity approach, we show that
these two approaches can be combined to create multilevel-multifidelity
sampling approaches. Finally, this chapter discusses Quasi-Monte Carlo (QMC)
or low-discrepancy sampling.
Monte Carlo is a popular algorithm for stochastic simulations due to its
simplicity, flexibility, and the provably convergent behavior that is
independent of the number of input uncertainties. A quantity of interest
, represented as a random variable
(RV), can be introduced as a function of a random vector
. The goal of any
MC simulation is computing statistics for , e.g. the expected
value . The MC estimator
for is defined
as follows
where and is used
to indicate the number of realizations of the model.
The MC estimator is unbiased, i.e., its bias is zero and its convergence to the true
statistics is . Moreover, since each
set of realizations for is different, another crucial property of any
estimator is its own variance:
Furthermore, it is possible to show, in the limit
, that the error
, where
represents a standard normal RV. As a
consequence, it is possible to define a confidence interval for the MC
estimator which has an amplitude proportional to the standard deviation
of the estimator. Indeed, the variance of the estimator plays a
fundamental role in the quality of the numerical results: the reduction
of the estimator variance correspond to an error reduction in the statistics.
A closer inspection of Eq. (38)
indicates that only an increase in the number of simulations
might reduce the overall variance, since
is an intrinsic property of the
model under analysis. However, more sophisticated techniques have been
proposed to accelerate the convergence of a MC simulation. For instance,
an incomplete list of these techniques can include stratified sampling,
importance sampling, Latin hypercube, deterministic Sobol’ sequences and
control variates (see [Owe13]). In particular, the control variate approach, is based
on the idea of replacing the RV with one that has
the same expected value, but with a smaller variance. The goal is to
reduce the numerator in Eq. (38),
and hence the value of the estimator variance without requiring a larger
number of simulations. In a practical setting, the control variate makes
use of an auxiliary RV for which
the expected value is known. Indeed,
the alternative estimator can be defined as
The MC control variate estimator is unbiased, but its variance now has a more complex
dependence not only on the , but
also on and the covariance between
and since
The parameter can be used to minimize the overall variance
leading to
for which the estimator variance follows as
Therefore, the overall variance of the estimator
is proportional to the variance of the standard
MC estimator through a factor
where
is the Pearson correlation coefficient between and .
Since , the variance
is always less than
the corresponding . The
control variate technique can be seen as a very general approach to
accelerate a MC simulation. The main step is to define a convenient
control variate function which is cheap to evaluate and well correlated
to the target function. For instance, function evaluations obtained
through a different (coarse) resolution may be employed or even coming
from a more crude physical/engineering approximation of the problem. A
viable way of building a well correlated control variate is to rely on a
low-fidelity model (i.e. a crude approximation of the model of interest)
to estimate the control variate using estimated control means (see
[NgW14, PTSW14] for more details). In this latter case,
clearly the expected value of the low-fidelity model is not known and needs to be computed.
With a slight change in notation, it is possible to write
where represents the MC estimator for the high-fidelity model, the MC estimator for the low-fidelity model
and a different approximation for . If samples are used for approximating and
and a total of samples for the low-fidelity models are available, an optimal solution, which guarantees the best use of the low-fidelity resources,
can be obtained following [NgW14] as
where and represent the cost of evaluating the high- and low-fidelity models, respectively and is the correlation between the two models. This solution leads to the following expression for the estimator variance
which shows similarities with the variance of a control variate estimator with the only difference being the term that, by multiplying the correlation
, effectively penalizes the estimator due to the need for estimating the low-fidelity mean.
Another common case encountered in practice is the availability of more than a low-fidelity model. In this case, the multifidelity Monte Carlo can be extended following
[PWG16, PWG18] as
where represents the generic ith low-fidelity model.
The MFMC estimator is still unbiased (similarly to MC) and share similarities with CVMC; indeed one can recover CVMC directly from it. For each low-fidelity model we use samples, as in the CVMC case, however for , the term is approximated with exactly the same samples of the previous model, while each is obtained by adding to this set a number of additional independent samples. Following [PWG16] the weights can be obtained as
The optimal resource allocation problem is also obtainable in closed-form if, as demonstrated in [PWG16] the following conditions, for the models’ correlations and costs, hold
In general engineering applications, the quantity of interest
is obtained as the result of the numerical solution of a partial partial
differential equation (possibly a system of them). Therefore, the
dependence on the physical
and/or temporal
coordinates should be included, hence
. A finite spatial/temporal
resolution is always employed to numerically solve a PDE, implying the
presence of a discretization error in addition to the stochastic error.
The term discretization is applied generically with reference to either
the spatial tessellation, the temporal resolution, or both (commonly,
they are linked). For a generic tessellation with
degrees-of-freedom (DOFs), the PDE solution of is referred to
as . Since for
, then
for with a prescribed order of convergence. A
MC estimator in presence of a finite spatial resolution and finite
sampling is
for which the mean square error (MSE) is
where the first term represents the variance of the estimator, and the
second term
reflects the bias introduced by the (finite) spatial discretization. The
two contributions appear to be independent of each other; accurate MC
estimates can only be obtained by drawing the required number
of simulations of at a sufficiently fine
resolution . Since the numerical cost of a PDE is related to
the number of DOFs of the tessellation, the total cost of a MC
simulation for a PDE can easily become intractable for complex
multi-physics applications that are computationally intensive.
The multilevel Monte Carlo (MLMC) algorithm has been introduced,
starting from the control variate idea, for situation in which
additional discretization levels can be defined. The basic idea,
borrowed from the multigrid approach, is to replace the evaluation of
the statistics of with a sequence of evaluations at coarser
levels. If it is possible to define a sequence of discretization levels
with
, the
expected value can be decomposed,
exploiting the linearity of the expected value operator as
If the difference function is defined according to
the expected value
.
A multilevel MC estimator is obtained when a MC estimator is adopted
independently for the evaluation of the expected value of
on each level. The resulting multilevel estimator
is
Since the multilevel estimator is unbiased, the advantage of using this
formulation is in its reduced estimator variance
:
since , the difference function
as the level increases.
Indeed, the corresponding number of samples required to
resolve the variance associated with the th level is
expected to decrease with .
The MLMC algorithm can be interpreted as a strategy to optimally
allocate resources. If the total cost of the MLMC algorithm is written
as
with being the cost of the evaluation of
(involving either one or two discretization evaluations),
then the following constrained minimization problem can be formulated
where an equality constraint enforces a stochastic error (from MLMC
estimator variance) equal to the residual bias error
()
using a Lagrange multiplier . This equality constraint
reflects a balance between the two contributions to MSE, reflecting the
goal to not over-resolve one or the other. The result of the
minimization is
defining an optimal sample allocation per discretization level.
Despite the original introduction of the MLMC approach for the
computation of the mean estimator in
[Gil08, Gil15], it is possible to estimate
higher-order moments with a MLMC sampling strategy, as for instance the
variance.
A single level unbiased estimator for the variance of a generic QoI at
the highest level of the hierarchy can be written as
The multilevel version of
Eq. (42)
can be obtained via a telescopic expansion in term of difference of
estimators over subsequent levels. To simplify the notation and for
simplicity of exposure from now on we only indicate the level, i.e..
The expansion is obtained by re-writing
Eq. (42)
as
It is important here to note that since the estimators at the levels
and are computed with the same number of
samples both estimators use the factor to obtain
their unbiased version. Moreover, each estimator is indeed written with
respect to its own mean value, i.e. the mean value on its level,
either or . This last requirement leads to
the computation of a local expected value estimator with respect to the
same samples employed for the difference estimator. If we now denote
with the sampling estimator for the second
order moment of the QoI we can write
Note that and are
explicitly sharing the same samples .
For this estimator we are interested in minimizing its cost while also
prescribing its variance as done for the expected value. This is
accomplished by evaluating the variance of the multilevel variance
estimator
where the covariance term is a result of the dependence described
in (44).
The previous expression can be evaluated once the variance for the
sample estimator of the second order order moment
and the covariance
term
are known. These terms can be evaluated as:
where denotes the sampling estimator for the
fourth order central moment.
The expression for the covariance term is more involved and can be
written as
The first term of the previous expression is evaluated by estimating and
combining several sampling moments as
It is important to note here that the previous expression can be
computed only if several sampling estimators for product of the QoIs at
levels and are available. These quantities
are not required in the standard MLMC implementation for the mean and
therefore for the estimation of the variance more data need to be stored
to assemble the quantities on each level.
An optimization problem, similar to the one formulated for the mean in
the previous section, can be written in the case of variance
This optimization problem can be solved in two different ways, namely an
analytical approximation and by solving a non-linear optimization
problem. The analytical approximation follows the approach described in
[PKN17] and introduces a helper variable
Next, the following constrained minimization problem is formulated
The second approach uses numerical optimization directly on the
non-linear optimization
problem (45) to
find an optimal sample allocation. Dakota uses OPTPP as the default
optimizer and switches to NPSOL if it is available.
Both approaches for finding the optimal sample allocation when
allocating for the variance are currently implemented in Dakota. The
analytical solution is employed by default while the optimization is
enabled using a keyword. We refer to the reference manual for a
discussion of the keywords to select these different options.
The extension of MLMC for the standard deviation is slightly more
complicated by the presence of the square root, which prevents a
straightforward expansion over levels.
One possible way of obtaining a biased estimator for the standard
deviation is
To estimate the variance of the standard deviation estimator, it is
possible to leverage the result, derived in the previous section for the
variance, and write the variance of the standard deviation as a function
of the variance and its estimator variance. If we can estimate the
variance and its estimator variance
, the variance for the
standard deviation can be approximated as
Similarly to the variance case, a numerical optimization problem can be
solved to obtain the sample allocation for the estimator of the standard
deviation given a prescribed accuracy target.
Often, especially in the context of optimization, it is necessary to
estimate statistics of a metric defined as a linear combination of
mean and standard deviation of a QoI. A classical reliability measure
can be defined, for the quantity , starting
from multilevel (ML) statistics, as
To obtain the sample allocation, in the MLMC context, it is necessary
to evaluate the variance of , which can be written as
This expression requires, in addition to the already available terms
and
, also the
covariance term . This latter term can be written knowing
that shared samples are only present on the same level
which leads to the need for evaluating the following four
contributions
In Dakota, we adopt the following approximation, for two arbitrary
levels and
(we indicate with the second central moment
for at the level ), which corresponds to
assuming that the correlation between expected value and variance is a
good approximation of the correlation between the expected value and
the standard deviation. This assumption is particularly convenient
because it is possible to obtain in closed form the covariance between
expected value and variance and, therefore, we can adopt the following
approximation
Finally, we can derive the term
for all possible cases
Even for this case, the sample allocation problem can be solved by
resorting to a numerical optimization given a prescribed target.
The MLMC approach described in Multilevel Monte Carlo can
be related to a recursive control variate technique in that it
seeks to reduce the variance of the target function in order to limit
the sampling at high resolution. In addition, the difference function
for each level can itself be the target of an additional
control variate (refer to Multifidelity Monte Carlo). A
practical scenario is when not only different resolution levels are
available (multilevel part), but also a cheaper computational model can
be used (multifidelity part). The combined approach is a
multilevel-multifidelity algorithm [FDKI17, GIE15, NT15], and in particular, a
multilevel-control variate Monte Carlo sampling approach.
If the target QoI can be generated from both a high-fidelity (HF) model
and a cheaper, possibly biased low-fidelity (LF) model, it is possible
to write the following estimator
The estimator is unbiased with
respect to , hence with respect to
the true value .
The control variate is obtained by means of the LF model realizations
for which the expected value can be computed in two different ways:
and
. A MC estimator is
employed for each term but the estimation of
is more resolved
than . For
, we choose the number of LF
realizations to be equal to the number of HF realizations,
. For the more resolved
, we augment with
an additional and independent set of realizations
, hence
.
The set is written, for convenience,
as proportional to by means of a
parameter
The set of samples is independent of
, therefore the variance of the estimator
can be written as (for further details see
[GIE15])
The Pearson’s correlation coefficient between the HF and LF models is
indicated by in the previous equations. Assuming the
vector as a parameter, the variance is minimized per
level, mimicking the standard control variate approach, and thus
obtaining the optimal coefficient as
.
By making use of the optimal coefficient , it is
possible to show that the variance
is
proportional to the variance
through a factor
, which is an explicit function of the
ratio :
Note that represents a penalty with
respect to the classical control variate approach presented in Multifidelity Monte Carlo, which stems from the need to
evaluate the unknown function
. However, the
ratio is dependent on the additional number of
LF evaluations , hence it is fair to
assume that it can be made very close to unity by choosing an affordably
large , i.e.,
.
The optimal sample allocation is determined taking into account the
relative cost between the HF and LF models and their correlation (per
level). In particular the optimization problem introduced in
Eq. (41) is replaced by
where the optimal allocation is obtained as well as the optimal ratio
. The cost per level includes now the sum of the HF and LF
realization cost, therefore it can be expressed as
.
If the cost ratio between the HF and LF model is
then the optimal ratio is
and the optimal allocation is
It is clear that the efficiency of the algorithm is related not only to
the efficiency of the LF model, i.e. how fast a simulation runs with
respect to the HF model, but also to the correlation between the LF and
HF model.
A potential refinement of the previous approach [GEI17] consists in exploiting
the QoI on each pair of levels, and , to
build a more correlated LF function. For instance, it is possible to use
and maximize the correlation between and
through the coefficient
.
Formally the two formulations are completely equivalent if
is replaced with
in
Eq. (48) and they can be
linked through the two ratios
obtaining the following variance for the estimator
Therefore, a way to increase the variance reduction is to maximize the
ratio with respect to the
parameter . It is possible to solve analytically this
maximization problem obtaining
The resulting optimal allocation of samples across levels and model
forms is given by
Quasi-Monte Carlo methods are equal-weight quadrature rules to approximate
with deterministically well-chosen sample
points to beat the notoriously slow convergence of a method that uses random MC
samples. They are of the form
which is seemingly identical to the form of the classic MC method from Eq.
(37), however, the -dimensional points
are now carefully chosen inside the domain
. With carefully chosen we mean that the
points have a low discrepancy . This discrepancy is
important, because it directly appears in the error bound of a QMC method, i.e.,
we have the Koksma-Hlawka inequality [Nie92]
The QMC error thus consists of two parts: a factor that only depends on the
point set (in particular, on the discrepancy of the point set) and a factor
depending only on the function we are trying to integrate (the
so-called variation of the function ).
Some famous examples of low-discrepancy point sets are Sobol points
[Sobol67], Halton points [HS64] and Hammersley points
[Ham13]. The advantage of using such a low-discrepancy point set
is faster convergence: classic theory states that a QMC method may converge like
, for sufficiently smooth functions , see
[DP10]. Compare this to the classic MC method, that converges like
, and it is easy to see why QMC methods are so appealing.
Unfortunately, the classic QMC theory is not adequate in high dimensions (large
): for to be smaller than , we
require, for example, , an unrealistically large number in
high dimensions. Furthermore, in many problems, the variation is
infinite, making the error bound in (52) practically
useless.
Then, in 1995, a 360-dimensional integral originating from financial mathematics
was computed very efficiently by Paskov and Traub, see [PT96]. This
led to many new theoretical developments, including the notion of weighted
function spaces and low effective dimension: although the problem is
high-dimensional, not all dimensions are equally important. In the work by Sloan
and Woźniakowski [SWozniakowski98], this decreasing importance is quantified in
terms of weights associated to each dimension , and
where one assumes . Contemporary QMC analysis is then performed by analyzing the problem in a
function space that incorporates these weights. A reinterpretation of
(52) in the weighted space setting with weights
is then
where is the so-called worst-case error, and
is the norm of the function in the weighted
function space. The question then becomes one of (strong) tractability: under
which conditions on the weights is the worst-case error bounded independent of
the dimension ? The philosophy of modern QMC is therefore to choose the
weights according to the problem at hand, and then construct a QMC method that
yields good performance for all functions that belong to this weighted function
space, see [DP10].
QMC methods come in two major flavors: lattice rules and digital nets. We
will now briefly discuss these two construction methods.
where denotes the fractional part, i.e., , and where
is an -dimensional vector with integers, called the generating
vector. Rank-1 lattices were introduced by Korobov [Kor59] and
Hlawka [Hla62], as the method of good lattice points.
The performance of the lattice rule depends critically on the choice of the
generating vector . We assume that , where , to ensure that every
one-dimensional projection of the points on one of the coordinate axes
has distinct values. It can be shown that the number of elements
inside the set is given by the Euler totient function
. For number theoretical reasons, is usually
restricted to be a prime number, such that the number of elements is
. In that case, there are an astounding possible choices for the generating vector . Since
it is impossible to perform an exhaustive search over all possible choices for
large and to find the best possible generating vector
, we resort to construction schemes that deliver good
choices for . An example of such a scheme is the
component-by-component (CBC) construction [Kor63, SJ94]. The
algorithm works as follows:
Set .
With fixed, pick such that
is minimized.
With and fixed, pick
such that is minimized.
…
Hence, this algorithm constructs the components of the generating vector for the
lattice rule one at a time: the th component is obtained by
successive one-dimensional searches, with the previous components kept
fixed. It can be proven that the CBC algorithm constructs generating vectors
that, when used in a lattice rule, achieve the desired convergence close to
, in some weighted function space, see [Kuo03].
For some particular choices of the weights (called
product weights), the cost of the CBC algorithm is
operations, i.e., linear in the dimension and almost linear in the
number of points , due to a fast CBC construction by Nuyens and Cools,
see [CKN06, NC06]. The idea for the fast construction is that the
CBC construction requires the evaluation of a matrix-vector multiplication with
a circulant matrix, hence reducing the cost of the matrix-vector product from
to by using FFT.
Fig. 64 Applying a -shift to a 21-point Fibonacci lattice in two
dimensions. Take the original lattice (left), apply a random shift
(middle) and wrap the points back onto the unit square (right).
The lattice points given in (53) can be randomized by
adding a random shift vector. If is a -dimensional
vector of standard normal random variables, we construct the shifted lattice
points as
This procedure is illustrated in Fig. 64. Note that the
first untransformed point in the sequence will be
.
For the lattice points to be practically useful, we would like to transform the
lattice rule into a lattice sequence, that allows us to generate
well-distributed points for an arbitrary number of points . To this
end, Eq. (53) is adapted to
where denotes the so-called radical inverse function in base
(usually, ). This function transforms a number in its base- representation to . Note that the radical inverse function agrees with the
original formulation when for any .
Digital nets and sequences were introduced by Niederreiter, building upon
earlier work by Sobol and Faure [Nie87]. In the digital
construction scheme, a sequence in dimensions generates points
, where the
th component is constructed as follows:
Write in its base- representation, i.e.,
Compute the matrix-vector product
where all additions and multiplications are performed in base .
Set the th component of the th points to
The matrices , are known as generating
matrices, see [DP10].
We can encode the generating matrices as an integer matrix as follows. The
number of rows in the matrix determines the maximum dimension of the lattice
rule. The number of columns in the matrix determines the log2 of the maximum
number of points. An integer on the th row and th column
encodes the th column of the th generating matrix
. Since the th column of is a collection of
0’s and 1’s, it can be represented as an integer with bits, where
is the number of rows in the th generating matrix
. By default, the encoding assumes the integers are stored with
least significant bit first (LSB), so that the first integer on the th row is 1. This LSB representation has two advantages.
The integers can be reused if the number of bits in the
representation changes.
It generally leads to smaller, human-readable numbers in the first few
entries.
The Sobol sequence is a particularly famous example of a digital net
[Sobol67]. A computer implementation of a Sobol sequence generator in
Fortran 77 was given by Bratley and Fox [BF88] as Algorithm 659.
This implementation allowed points of up to 40 dimensions. It was extended by
Joe and Kuo to allow up to 1111 dimensions in [JK03] and up to 21201
dimensions in [JK08]. In the Dakota implementation of the algorithm
outlined above, we use the iterative construction from Antonov and Saleev
[AS79], that generates the points in Gray code ordering. Knowing
the current point with (Gray code) index , the next point with index
is obtained by XOR’ing the current point with the th
column of the th generating matrix, i.e.,
where is the rightmost zero-bit of (the position of the bit that
will change from index to in Gray code).
The digital net points can be randomized by adding a digital shift vector. If
is a -dimensional vector of standard normal random
variables, we construct the shifted lattice points as
, where is the
element-wise -ary addition (or XOR) operator, see [DP10].
Note that the first untransformed point in the sequence will be
.
Ideally, the digital net should preserve the structure of the points after
randomization. This can be achieved by scrambling the digital net. Scrambling
can also improve the rate of convergence of a method that uses these scrambled
points to compute the mean of the model response. Owen’s scrambling
[Owe98] is the most well-known scrambling technique. A particular
variant is linear matrix scrambling, see [Matouvsek98], which is
implemented in Dakota. In linear matrix scrambling, we left-multiply each
generating matrix with a lower-triangular random scramble matrix with 1s on the
diagonal, i.e.,
1 0 0 0 0
x 1 0 0 0
x x 1 0 0
x x x 1 0
x x x x 1
Finally, for the digital net points to be practically useful, we would like to
transform the digital net into a digital sequence, that allows us to generate
well-distributed points for an arbitrary number of points . One way to
do this is to generate the points in Gray code ordering, as discussed above.
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.