# Nonlinear Least Squares

## Overview

Any Dakota optimization algorithm can be applied to calibration problems arising in parameter estimation, system identification, and test/analysis reconciliation. However, nonlinear least-squares methods are optimization algorithms that exploit the special structure of a sum of the squares objective function [GMW81].

To exploit the problem structure, more granularity is needed in the
response data than is required for a typical optimization problem. That
is, rather than using the sum-of-squares objective function and its
gradient, least-squares iterators require each term used in the
sum-of-squares formulation along with its gradient. This means that the
\(m\) functions in the Dakota response data set consist of the
individual least-squares terms along with any nonlinear inequality and
equality constraints. These individual terms are often called
*residuals* when they denote differences of observed quantities from
values computed by the model whose parameters are being estimated.

The enhanced granularity needed for nonlinear least-squares algorithms allows for simplified computation of an approximate Hessian matrix. In Gauss-Newton-based methods for example, the true Hessian matrix is approximated by neglecting terms in which residuals multiply Hessians (matrices of second partial derivatives) of residuals, under the assumption that the residuals tend towards zero at the solution. As a result, residual function value and gradient information (first-order information) is sufficient to define the value, gradient, and approximate Hessian of the sum-of-squares objective function (second-order information). See formulations for additional details on this approximation.

In practice, least-squares solvers will tend to be significantly more efficient than general-purpose optimization algorithms when the Hessian approximation is a good one, e.g., when the residuals tend towards zero at the solution. Specifically, they can exhibit the quadratic convergence rates of full Newton methods, even though only first-order information is used. Gauss-Newton-based least-squares solvers may experience difficulty when the residuals at the solution are significant. Dakota has three solvers customized to take advantage of the sum of squared residuals structure in this problem formulation. Least squares solvers may experience difficulty when the residuals at the solution are significant, although experience has shown that Dakota’s NL2SOL method can handle some problems that are highly nonlinear and have nonzero residuals at the solution.

## Nonlinear Least Squares Fomulations

Specialized least squares solution algorithms can exploit the structure of a sum of the squares objective function for problems of the form:

where \(f(\mathbf{x})\) is the objective function to be minimized and \(T_i(\mathbf{x})\) is the i\(^{\mathrm{th}}\) least squares term. The bound, linear, and nonlinear constraints are the same as described previously for (35), Optimization Formulations. Specialized least squares algorithms are generally based on the Gauss-Newton approximation. When differentiating \(f(\mathbf{x})\) twice, terms of \(T_i(\mathbf{x})T''_i(\mathbf{x})\) and \([T'_i(\mathbf{x})]^{2}\) result. By assuming that the former term tends toward zero near the solution since \(T_i(\mathbf{x})\) tends toward zero, then the Hessian matrix of second derivatives of \(f(\mathbf{x})\) can be approximated using only first derivatives of \(T_i(\mathbf{x})\). As a result, Gauss-Newton algorithms exhibit quadratic convergence rates near the solution for those cases when the Hessian approximation is accurate, i.e. the residuals tend towards zero at the solution. Thus, by exploiting the structure of the problem, the second order convergence characteristics of a full Newton algorithm can be obtained using only first order information from the least squares terms.

A common example for \(T_i(\mathbf{x})\) might be the difference between experimental data and model predictions for a response quantity at a particular location and/or time step, i.e.:

where \(R_i(\mathbf{x})\) is the response quantity predicted by the model and \(\bar{R_i}\) is the corresponding experimental data. In this case, \(\mathbf{x}\) would have the meaning of model parameters which are not precisely known and are being calibrated to match available data. This class of problem is known by the terms parameter estimation, system identification, model calibration, test/analysis reconciliation, etc.

## Nonlinear Least Squares with Dakota

In order to specify a least-squares problem, the responses section of
the Dakota input should be configured using `calibration_terms`

(as
opposed to `objective_functions`

as for optimization). The calibration
terms refer to the residuals (differences between the simulation model
and the data). Note that Dakota expects the residuals, not the squared
residuals, and offers options for instead returning the simulation
output to Dakota together with a separate `calibration_data`

file,
from which residuals will be calculated. Any linear or nonlinear
constraints are handled in an identical way to that of optimization (see
Section Optimization Formulations (35) ; note that neither
Gauss-Newton nor NLSSOL require any constraint augmentation and NL2SOL
supports neither linear nor nonlinear constraints). Gradients of the
least-squares terms and nonlinear constraints are required and should be
specified using either `numerical_gradients`

, `analytic_gradients`

,
or `mixed_gradients`

. Since explicit second derivatives are not used
by the least-squares methods, the `no_hessians`

specification should
be used. Dakota’s scaling options, described in
Section Optimization with User-specified or Automatic Scaling, Listing 53 can be
used on least-squares problems, using the `calibration_term_scales`

keyword to scale least-squares residuals, if desired.

## Solution Techniques

Nonlinear least-squares problems can be solved using the Gauss-Newton algorithm, which leverages the full Newton method from OPT++, the NLSSOL algorithm, which is closely related to NPSOL, or the NL2SOL algorithm, which uses a secant-based algorithm. Details for each are provided below.

### Gauss-Newton

Dakota’s Gauss-Newton algorithm consists of combining an implementation of the Gauss-Newton Hessian approximation (see Section Nonlinear Least Squares Fomulations) with full Newton optimization algorithms from the OPT++ package [MOHW07] (see Section Methods for Constrained Problems). The exact objective function value, exact objective function gradient, and the approximate objective function Hessian are defined from the least squares term values and gradients and are passed to the full-Newton optimizer from the OPT++ software package. As for all of the Newton-based optimization algorithms in OPT++, unconstrained, bound-constrained, and generally-constrained problems are supported. However, for the generally-constrained case, a derivative order mismatch exists in that the nonlinear interior point full Newton algorithm will require second-order information for the nonlinear constraints whereas the Gauss-Newton approximation only requires first order information for the least squares terms. License: LGPL.

This approach can be selected using the `optpp_g_newton`

method
specification. An example specification follows:

```
method,
optpp_g_newton
max_iterations = 50
convergence_tolerance = 1e-4
output debug
```

Refer to the Dakota Reference Manual Keyword Reference for more detail on the input commands for the Gauss-Newton algorithm.

The Gauss-Newton algorithm is gradient-based and is best suited for efficient navigation to a local least-squares solution in the vicinity of the initial point. Global optima in multimodal design spaces may be missed. Gauss-Newton supports bound, linear, and nonlinear constraints. For the nonlinearly-constrained case, constraint Hessians (required for full-Newton nonlinear interior point optimization algorithms) are approximated using quasi-Newton secant updates. Thus, both the objective and constraint Hessians are approximated using first-order information.

### NLSSOL

The NLSSOL algorithm is bundled with NPSOL. It uses an SQP-based approach to solve generally-constrained nonlinear least-squares problems. It periodically employs the Gauss-Newton Hessian approximation to accelerate the search. Like the Gauss-Newton algorithm of Section Gauss-Newton), its derivative order is balanced in that it requires only first-order information for the least-squares terms and nonlinear constraints. License: commercial; see NPSOL Methods for Constrained Problems.

This approach can be selected using the `nlssol_sqp`

method
specification. An example specification follows:

```
method,
nlssol_sqp
convergence_tolerance = 1e-8
```

Refer to the Dakota Reference Manual Keyword Reference for more detail on the input commands for NLSSOL.

### NL2SOL

The NL2SOL algorithm [DGW81] is a secant-based least-squares algorithm that is \(q\)-superlinearly convergent. It adaptively chooses between the Gauss-Newton Hessian approximation and this approximation augmented by a correction term from a secant update. NL2SOL tends to be more robust (than conventional Gauss-Newton approaches) for nonlinear functions and “large residual” problems, i.e., least-squares problems for which the residuals do not tend towards zero at the solution. License: publicly available.

### Additional Features

Dakota’s tailored derivative-based least squares solvers (but not general optimization solvers) output confidence intervals on estimated parameters. The reported confidence intervals are univariate (per-parameter), based on local linearization, and will contain the true value of the parameters with 95% confidence. Their calculation essentially follows the exposition in [SW03] and is summarized here.

Denote the variance estimate at the optimal calibrated parameters \(\hat{x}\) by

where \(T_i\) are the least squares terms (typically residuals) discussed above and \(N_{dof} = n - p\) denotes the number of degrees of freedom (total residuals \(n\) less the number of calibrated parameters \(p\)). Let

denote the \(n \times p\) matrix of partial derivatives of the residuals with respect to the calibrated paramters. Then the standard error \(SE_i\) for calibrated parameter \(x_i\) is given by

Using a Student’s t-distribution with \(N_{dof}\) degrees of freedom, the 95% confidence interval for each parameter is given by

In the case where estimated gradients are extremely inaccurate or the model is very nonlinear, the confidence intervals reported are likely inaccurate as well. Further, confidence intervals cannot be calculated when the number of least-squares terms is less than the number of parameters to be estimated, when using vendor numerical gradients, or where there are replicate experiments. See [VSR+07] for more details about confidence intervals, and note that there are alternative approaches such as Bonferroni confidence intervals and joint confidence intervals based on linear approximations or F-tests.

Least squares calibration terms (responses) can be weighted. When
observation error variance is provided alongside calibration data, its
inverse is applied to yield the typical variance-weighted least squares
formulation. Alternately, the `calibration_terms weights`

specification can be used to weight the squared residuals. (Neither set
of weights are adjusted during calibration as they would be in
iteratively re-weighted least squares.) When response scaling is active,
it is applied after error variance weighting and before `weights`

application. The `calibration_terms`

keyword documentation in the
Dakota Reference Manual Keyword Reference has more detail about
weighting and scaling of the residual terms.

## Examples

Both the Rosenbrock and textbook example problems can be formulated as nonlinear least-squares problems. Refer to Additional Examples for more information on these formulations.

Figure Listing 54 shows an excerpt from the output obtained when running NL2SOL on a five-dimensional problem. Note that the optimal parameter estimates are printed, followed by the residual norm and values of the individual residual terms, followed by the confidence intervals on the parameters.

```
------------------------------
<<<<< Iterator nl2sol completed.
<<<<< Function evaluation summary: 27 total (26 new, 1 duplicate)
<<<<< Best parameters =
3.7541004764e-01 x1
1.9358463401e+00 x2
-1.4646865611e+00 x3
1.2867533504e-02 x4
2.2122702030e-02 x5
<<<<< Best residual norm = 7.3924926090e-03; 0.5 * norm^2 = 2.7324473487e-05
<<<<< Best residual terms =
-2.5698266189e-03
4.4759880011e-03
9.9223430643e-04
-1.0634409194e-03
...
Confidence Interval for x1 is [ 3.7116510206e-01, 3.7965499323e-01 ]
Confidence Interval for x2 is [ 1.4845485507e+00, 2.3871441295e+00 ]
Confidence Interval for x3 is [ -1.9189348458e+00, -1.0104382765e+00 ]
Confidence Interval for x4 is [ 1.1948590669e-02, 1.3786476338e-02 ]
Confidence Interval for x5 is [ 2.0289951664e-02, 2.3955452397e-02 ]
```

The analysis driver script (the script being driven by Dakota) has to
perform several tasks in the case of parameter estimation using
nonlinear least-squares methods. The analysis driver script must: (1)
read in the values of the parameters supplied by Dakota; (2) run the
computer simulation with these parameter values; (3) retrieve the
results from the computer simulation; (4) compute the difference between
each computed simulation value and the corresponding experimental or
measured value; and (5) write these residuals (differences) to an
external file that gets passed back to Dakota. Note there will be one
line per residual term, specified with `calibration_terms`

in the
Dakota input file. It is the last two steps which are different from
most other Dakota applications.

To simplify specifying a least squares problem, one may provide Dakota a
data file containing experimental results or other calibration data. In
the case of scalar calibration terms, this file may be specified with .
In this case, Dakota will calculate the residuals (that is, the
simulation model results minus the experimental results), and the
user-provided script can omit this step: the script can just return the
simulation outputs of interest. An example of this can be found in the
file named `dakota/share/dakota/examples/users/textbook_nls_datafile.in`

.
In this example, there are 3 residual terms. The data file
of experimental results associated with this example is
`textbook_nls_datafile.lsq.dat`

. These three
values are subtracted from the least-squares terms to produce residuals
for the nonlinear least-squares problem. Note that the file may be
annotated (specified by `annotated`

) or freeform (specified by
`freeform`

). The number of experiments in the calibration data file
may be specified with , with one row of data per experiment. When
multiple experiments are present, the total number of least squares
terms will be the number of calibration terms times the number of
experiments.

Finally, the calibration data file may contain additional information
than just the observed experimental responses. If the observed data has
measurement error associated with it, this can be specified in columns
of such error data after the response data. The type of measurement
error is specified by `variance_type`

. For scalar calibration terms,
the `variance_type`

can be either `none`

(the user does not specify
a measurement variance associated with each calibration term) or
`scalar`

(the user specifies one measurement variance per calibration
term). For field calibration terms, the `variance_type`

can also be
`diagonal`

or `matrix`

. These are explained in more detail in the
Reference manual. See the Keyword Reference
for more information. Additionally, there is sometimes the need to specify
configuration variables. These are often used in Bayesian calibration
analysis. These are specified as `num_config_variables`

. If the user
specifies a positive number of configuration variables, it is expected
that they will occur in the text file before the responses.

## Usage Guidelines

Calibration problems can be transformed to general optimization problems where the objective is some type of aggregated error metric. For example, the objective could be the sum of squared error terms. However, it also could be the mean of the absolute value of the error terms, the maximum difference between the simulation results and observational results, etc. In all of these cases, one can pose the calibration problem as an optimization problem that can be solved by any of Dakota’s optimizers. In this situation, when applying an general optimization solver to a calibration problem, the guidelines in Guide Table 12 still apply.

In some cases, it will be better to use a nonlinear least-squares method instead of a general optimizer to determine optimal parameter values which result in simulation responses that “best fit” the observational data. Nonlinear least squares methods exploit the special structure of a sum of the squares objective function. They can be much more efficient than general optimizers. However, these methods require the gradients of the function with respect to the parameters being calibrated. If the model is not able to produce gradients, one can use finite differencing to obtain gradients. However, the gradients must be reasonably accurate for the method to proceed. Note that the nonlinear least-squares methods only operate on a sum of squared errors as the objective. Also, the user must return each residual term separately to Dakota, whereas the user can return an aggregated error measure in the case of general optimizers.

The three nonlinear least-squares methods are the Gauss-Newton method in OPT++, NLSSOL, and NL2SOL. Any of these may be tried; they give similar performance on many problems. NL2SOL tends to be more robust than Gauss-Newton, especially for nonlinear functions and large-residual problems where one is not able to drive the residuals to zero at the solution. NLSSOL does require that the user has the NPSOL library. Note that all of these methods are local in the sense that they are gradient-based and depend on an initial starting point. Often they are used in conjunction with a multi-start method, to perform several repetitions of the optimization at different starting points in the parameter space. Another approach is to use a general global optimizer such as a genetic algorithm or DIRECT as mentioned above. This can be much more expensive, however, in terms of the number of function evaluations required.