Simulation Failure Capturing

Dakota provides the capability to manage failures in simulation codes within its system call, fork, and direct simulation interfaces (see Simulation Interfaces for simulation interface descriptions). Failure capturing consists of three operations: failure detection, failure communication, and failure mitigation.

Failure detection

Since the symptoms of a simulation failure are highly code and application dependent, it is the user’s responsibility to detect failures within their analysis_driver, input_filter, or output_filter. One popular example of simulation monitoring is to rely on a simulation’s internal detection of errors. In this case, the UNIX grep utility can be used within a user’s driver/filter script to detect strings in output files which indicate analysis failure. For example, the following simple C shell script excerpt

grep ERROR analysis.out > /dev/null
if ( $status == 0 )
  echo "FAIL" > results.out
endif

will pass the if test and communicate simulation failure to Dakota if the grep command finds the string ERROR anywhere in the analysis.out file. The /dev/null device file is called the “bit bucket” and the grep command output is discarded by redirecting it to this destination. The $status shell variable contains the exit status of the last command executed [AA86], which is the exit status of grep in this case (0 if successful in finding the error string, nonzero otherwise). For Bourne shells [Bli96], the $? shell variable serves the same purpose as $status for C shells. In a related approach, if the return code from a simulation can be used directly for failure detection purposes, then $status or $? could be queried immediately following the simulation execution using an if test like that shown above.

If the simulation code is not returning error codes or providing direct error diagnostic information, then failure detection may require monitoring of simulation results for sanity (e.g., is the mesh distorting excessively?) or potentially monitoring for continued process existence to detect a simulation segmentation fault or core dump. While this can get complicated, the flexibility of Dakota’s interfaces allows for a wide variety of user-defined monitoring approaches.

Failure communication

Once a failure is detected, it must be communicated so that Dakota can take the appropriate corrective action. The form of this communication depends on the type of simulation interface in use.

In the system call and fork simulation interfaces, a detected simulation failure is communicated to Dakota through the results file. When using the standard results file format, the string “fail” should appear at the beginning of the results file. Any data appearing after the fail string will be ignored. Also, Dakota’s detection of this string is case insensitive, so “FAIL”, “Fail”, etc., are equally valid. For JSON, failure is communicated to Dakota by including the name:value pair "fail": "true" in the evaluation object. Both the name and value must be lowercase.

In the direct simulation interface case, a detected simulation failure is communicated to Dakota through the return code provided by the user’s analysis_driver, input_filter, or output_filter. As shown in Extension, the prototype for simulations linked within the direct interface includes an integer return code. This code has the following meanings: zero (false) indicates that all is normal and nonzero (true) indicates an exception (i.e., a simulation failure).

Failure mitigation

Once the analysis failure has been communicated, Dakota will attempt to recover from the failure using one of the following four mechanisms, as governed by the interface specification in the user’s input file.

Abort (default)

If the abort option is active (the default), then Dakota will terminate upon detecting a failure. Note that if the problem causing the failure can be corrected, Dakota’s restart capability (see The Dakota Restart Utility) can be used to continue the study.

Retry

If the retry option is specified, then Dakota will re-invoke the failed simulation up to the specified number of retries. If the simulation continues to fail on each of these retries, Dakota will terminate. The retry option is appropriate for those cases in which simulation failures may be resulting from transient computing environment issues, such as shared disk space, software license access, or networking problems.

Recover

If the recover option is specified, then Dakota will not attempt the failed simulation again. Rather, it will return a “dummy” set of function values as the results of the function evaluation. The dummy function values to be returned are specified by the user. Any gradient or Hessian data requested in the active set vector will be zero. This option is appropriate for those cases in which a failed simulation may indicate a region of the design space to be avoided and the dummy values can be used to return a large objective function or constraint violation which will discourage an optimizer from further investigating the region.

Continuation

If the continuation option is specified, then Dakota will attempt to step towards the failing “target” simulation from a nearby “source” simulation through the use of a continuation algorithm. This option is appropriate for those cases in which a failed simulation may be caused by an inadequate initial guess. If the “distance” between the source and target can be divided into smaller steps in which information from one step provides an adequate initial guess for the next step, then the continuation method can step towards the target in increments sufficiently small to allow for convergence of the simulations.

When the failure occurs, the interval between the last successful evaluation (the source point) and the current target point is halved and the evaluation is retried. This halving is repeated until a successful evaluation occurs. The algorithm then marches towards the target point using the last interval as a step size. If a failure occurs while marching forward, the interval will be halved again. Each invocation of the continuation algorithm is allowed a total of ten failures (ten halvings result in up to 1024 evaluations from source to target) prior to aborting the Dakota process.

While Dakota manages the interval halving and function evaluation invocations, the user is responsible for managing the initial guess for the simulation program. For example, in a GOMA input file [SSR+95], the user specifies the files to be used for reading initial guess data and writing solution data. When using the last successful evaluation in the continuation algorithm, the translation of initial guess data can be accomplished by simply copying the solution data file leftover from the last evaluation to the initial guess file for the current evaluation (and in fact this is useful for all evaluations, not just continuation). However, a more general approach would use the closest successful evaluation (rather than the last successful evaluation) as the source point in the continuation algorithm. This will be especially important for nonlocal methods (e.g., genetic algorithms) in which the last successful evaluation may not necessarily be in the vicinity of the current evaluation. This approach will require the user to save and manipulate previous solutions (likely tagged with evaluation number) so that the results from a particular simulation (specified by Dakota after internal identification of the closest point) can be used as the current simulation’s initial guess. This more general approach is not yet supported in Dakota.

Special values

In IEEE arithmetic, “NaN” indicates “not a number” and $\pm$ “Inf” or $\pm$ “Infinity” indicates positive or negative infinity. These special values may be returned directly in function evaluation results from a simulation interface or they may be specified in a user’s input file within the recover specification described in Recover. There is a key difference between these two cases. In the former case of direct simulation return, failure mitigation can be managed on a per response function basis. When using recover, however, the failure applies to the complete set of simulation results.

In both of these cases, the handling of NaN or Inf is managed using iterator-specific approaches. Currently, nondeterministic sampling methods (see Sampling Methods), polynomial chaos expansions using either regression approaches or spectral projection with random sampling (see Stochastic Expansion Methods), and the NL2SOL method for nonlinear least squares (see NL2SOL) are the only methods with special numerical exception handling: the sampling methods simply omit any samples that are not finite from the statistics generation, the polynomial chaos methods omit any samples that are not finite from the coefficient estimation, and NL2SOL treats NaN or Infinity in a residual vector (i.e., values in a results file for a function evaluation) computed for a trial step as an indication that the trial step was too long and violates an unstated constraint; NL2SOL responds by trying a shorter step.