principal_components

Activates principal components analysis of the response matrix of N samples * L responses.

Specification

Alias: None
Arguments: None

Child Keywords:

Required/Optional	Description of Group	Dakota Keyword	Dakota Keyword Description
Optional		percent_variance_explained	Specifies the number of components to retain to explain the specified percent variance.

Description

Dakota can calculate the principal components of the response matrix of N samples * L responses using the keyword principal_components. Principal components analysis (PCA) is a data reduction method. The Dakota implementation is under active development: the PCA capability may ultimately be specified elsewhere or used in different ways. For now, it is performed as a post-processing analysis based on a set of Latin Hypercube samples.

We now have field responses in Dakota. PCA is an initial approach in Dakota to analyze and represent the field data. Specifically, if we have a sample ensemble of field data responses, we want to identify the principal components responsible for the spread of that data. Then, we can generate a surrogate model by representing the overall response as weighted sum of M principal components, where the weights will be determined by GPs which are a function of the input uncertain variables. This reduced form then can be used for sensitivity analysis, calibration, etc.

The steps involved when one specifies principal_components in Dakota are as follows:

Create an LHS input sample based on the uncertain variable specification and run the user-specified model at the LHS points to compute the field responses. For notation purposes, there are d input parameters, N samples, and the field length is L.
Perform PCA on the covariance matrix of the data set from the previous step. This is done by first centering the data (e.g. subtracting the mean of each column from that column) and performing a singular value decomposition on the covariance matrix of the centered data. The eigenvectors of the covariance matrix correspond to the principal components.
Identify M principal components based on the percentage of variance explained. There is an optional keyword for principal_components called percent_variance_explained, which is a threshold that determines the number of components that are retained to explain at least that amount of variance. For example, if the user specifies percent_variance_explained = 0.99, the number of components that accounts for at least 99 percent of the variance in the responses will be retained. The default for this percentage is 0.95. In many applications, only a few principal components explain the majority of the variance, resulting in significant data reduction.
Use the principal components in a predictive sense, by constructing a prediction approximation. The basis functions for this approximation are the principal components. The coefficients of the bases are obtained by constructing GP surrogates for the factor scores of the M principal components. The GP surrogates will be functions of the uncertain inputs. The idea is that we have just performed PCA on (for example) the covariance matrix of 100 samples. Typically, those 100 samples will be generated by sampling over some d uncertain input parameters denoted by u, so there should be a mapping from u to the field data, specifically to the loading coefficients and the factor scores. Currently, the final item printed from a Principal Components Analysis in Dakota is a set of prediction samples based on this prediction approximation or surrogate model that relies on the principal components.

Default Behavior

principal_components is turned off as a default. It may be used with either scalar responses or field responses, but it is intended to be used with large field responses as a data reduction method. For example, typically we expect the number of LHS samples, N, to be less than the number of field responses, L (e.g. if there is one field, the number of responses values is the length of that field).

Expected Outputs

When principal_components is specified, the number of significant principal components is printed along with the predictions based on the principal components. If output debug is specified, additional information is printed, including the original response matrix, the centered data, the principal components, and the factor scores.

Usage Tips

This is a preliminary capability that is undergoing active development. Please contact the Dakota developers team if you have problems with using this capability or want to suggest additional features.

Examples

method,
  sampling
    sample_type lhs
    samples = 100
    principal_components
    percent_variance_explained = 0.98

Theory

There is an extensive statistical literature available on PCA. We recommend that the interested user peruse some of this in using the PCA capability.

principal_components

Exceptional service in the national interest