Simulation Failure Capturing
Dakota provides the capability to manage failures in simulation codes within its system call, fork, and direct simulation interfaces (see Simulation Interfaces for simulation interface descriptions). Failure capturing consists of three operations: failure detection, failure communication, and failure mitigation.
Failure detection
Since the symptoms of a simulation failure are highly code and
application dependent, it is the user’s responsibility to detect
failures within their analysis_driver
, input_filter
, or
output_filter
. One popular example of simulation monitoring is to
rely on a simulation’s internal detection of errors. In this case, the
UNIX grep
utility can be used within a user’s driver/filter script
to detect strings in output files which indicate analysis failure. For
example, the following simple C shell script excerpt
grep ERROR analysis.out > /dev/null
if ( $status == 0 )
echo "FAIL" > results.out
endif
will pass the if
test and communicate simulation failure to Dakota
if the grep
command finds the string ERROR
anywhere in the
analysis.out
file. The /dev/null
device file is called the
“bit bucket” and the grep
command
output is discarded by redirecting it to this destination. The
$status
shell variable contains the exit status of the last command
executed [AA86], which is the exit status of grep
in this case (0 if successful in finding the error string, nonzero
otherwise). For Bourne shells [Bli96], the $?
shell variable serves the same purpose as $status
for C shells. In a
related approach, if the return code from a simulation can be used
directly for failure detection purposes, then $status
or $?
could be queried immediately following the simulation execution using an
if
test like that shown above.
If the simulation code is not returning error codes or providing direct error diagnostic information, then failure detection may require monitoring of simulation results for sanity (e.g., is the mesh distorting excessively?) or potentially monitoring for continued process existence to detect a simulation segmentation fault or core dump. While this can get complicated, the flexibility of Dakota’s interfaces allows for a wide variety of user-defined monitoring approaches.
Failure communication
Once a failure is detected, it must be communicated so that Dakota can take the appropriate corrective action. The form of this communication depends on the type of simulation interface in use.
In the system call and fork simulation interfaces, a detected simulation
failure is communicated to Dakota through the results file. When using
the standard results file format,
the string “fail
” should appear at the beginning of the results file.
Any data appearing after the fail string will be ignored. Also, Dakota’s
detection of this string is case insensitive, so “FAIL
”, “Fail
”,
etc., are equally valid. For JSON, failure
is communicated to Dakota by including the name:value pair "fail": "true"
in the evaluation object. Both the name and value must be lowercase.
In the direct simulation interface case, a detected simulation failure
is communicated to Dakota through the return code provided by the user’s
analysis_driver
, input_filter
, or output_filter
. As shown in
Extension, the
prototype for simulations linked within the direct interface includes an
integer return code. This code has the following meanings: zero (false)
indicates that all is normal and nonzero (true) indicates an exception
(i.e., a simulation failure).
Failure mitigation
Once the analysis failure has been communicated, Dakota will attempt to
recover from the failure using one of the following four mechanisms, as
governed by the interface
specification in the user’s input file.
Abort (default)
If the abort
option is active (the default), then Dakota will
terminate upon detecting a failure. Note that if the problem causing the
failure can be corrected, Dakota’s restart capability (see
The Dakota Restart Utility) can be used to continue the study.
Retry
If the retry
option is specified, then Dakota will re-invoke the
failed simulation up to the specified number of retries. If the
simulation continues to fail on each of these retries, Dakota will
terminate. The retry option is appropriate for those cases in which
simulation failures may be resulting from transient computing
environment issues, such as shared disk space, software license access,
or networking problems.
Recover
If the recover
option is specified, then Dakota will not attempt the
failed simulation again. Rather, it will return a “dummy” set of
function values as the results of the function evaluation. The dummy
function values to be returned are specified by the user. Any gradient
or Hessian data requested in the active set vector will be zero. This
option is appropriate for those cases in which a failed simulation may
indicate a region of the design space to be avoided and the dummy values
can be used to return a large objective function or constraint violation
which will discourage an optimizer from further investigating the
region.
Continuation
If the continuation
option is specified, then Dakota will attempt to
step towards the failing “target” simulation from a nearby “source”
simulation through the use of a continuation algorithm. This option is
appropriate for those cases in which a failed simulation may be caused
by an inadequate initial guess. If the “distance” between the source and
target can be divided into smaller steps in which information from one
step provides an adequate initial guess for the next step, then the
continuation method can step towards the target in increments
sufficiently small to allow for convergence of the simulations.
When the failure occurs, the interval between the last successful evaluation (the source point) and the current target point is halved and the evaluation is retried. This halving is repeated until a successful evaluation occurs. The algorithm then marches towards the target point using the last interval as a step size. If a failure occurs while marching forward, the interval will be halved again. Each invocation of the continuation algorithm is allowed a total of ten failures (ten halvings result in up to 1024 evaluations from source to target) prior to aborting the Dakota process.
While Dakota manages the interval halving and function evaluation invocations, the user is responsible for managing the initial guess for the simulation program. For example, in a GOMA input file [SSR+95], the user specifies the files to be used for reading initial guess data and writing solution data. When using the last successful evaluation in the continuation algorithm, the translation of initial guess data can be accomplished by simply copying the solution data file leftover from the last evaluation to the initial guess file for the current evaluation (and in fact this is useful for all evaluations, not just continuation). However, a more general approach would use the closest successful evaluation (rather than the last successful evaluation) as the source point in the continuation algorithm. This will be especially important for nonlocal methods (e.g., genetic algorithms) in which the last successful evaluation may not necessarily be in the vicinity of the current evaluation. This approach will require the user to save and manipulate previous solutions (likely tagged with evaluation number) so that the results from a particular simulation (specified by Dakota after internal identification of the closest point) can be used as the current simulation’s initial guess. This more general approach is not yet supported in Dakota.
Special values
In IEEE arithmetic, “NaN” indicates “not a number” and
\(\pm\)“Inf” or \(\pm\)“Infinity” indicates positive or
negative infinity. These special values may be returned directly in
function evaluation results from a simulation interface or they may be
specified in a user’s input file within the recover
specification
described in Recover. There is a
key difference between these two cases. In the former case of direct
simulation return, failure mitigation can be managed on a per response
function basis. When using recover
, however, the failure applies to
the complete set of simulation results.
In both of these cases, the handling of NaN or Inf is managed using iterator-specific approaches. Currently, nondeterministic sampling methods (see Sampling Methods), polynomial chaos expansions using either regression approaches or spectral projection with random sampling (see Stochastic Expansion Methods), and the NL2SOL method for nonlinear least squares (see NL2SOL) are the only methods with special numerical exception handling: the sampling methods simply omit any samples that are not finite from the statistics generation, the polynomial chaos methods omit any samples that are not finite from the coefficient estimation, and NL2SOL treats NaN or Infinity in a residual vector (i.e., values in a results file for a function evaluation) computed for a trial step as an indication that the trial step was too long and violates an unstated constraint; NL2SOL responds by trying a shorter step.