.. _method-sampling-principal_components: """""""""""""""""""" principal_components """""""""""""""""""" Activates principal components analysis of the response matrix of N samples * L responses. .. toctree:: :hidden: :maxdepth: 1 method-sampling-principal_components-percent_variance_explained **Specification** - *Alias:* None - *Arguments:* None **Child Keywords:** +-------------------------+--------------------+--------------------------------+---------------------------------------------+ | Required/Optional | Description of | Dakota Keyword | Dakota Keyword Description | | | Group | | | +=========================+====================+================================+=============================================+ | Optional | `percent_variance_explained`__ | Specifies the number of components to | | | | retain to explain the specified percent | | | | variance. | +----------------------------------------------+--------------------------------+---------------------------------------------+ .. __: method-sampling-principal_components-percent_variance_explained.html **Description** Dakota can calculate the principal components of the response matrix of N samples * L responses using the keyword ``principal_components``. Principal components analysis (PCA) is a data reduction method. The Dakota implementation is under active development: the PCA capability may ultimately be specified elsewhere or used in different ways. For now, it is performed as a post-processing analysis based on a set of Latin Hypercube samples. We now have field responses in Dakota. PCA is an initial approach in Dakota to analyze and represent the field data. Specifically, if we have a sample ensemble of field data responses, we want to identify the principal components responsible for the spread of that data. Then, we can generate a surrogate model by representing the overall response as weighted sum of M principal components, where the weights will be determined by GPs which are a function of the input uncertain variables. This reduced form then can be used for sensitivity analysis, calibration, etc. The steps involved when one specifies ``principal_components`` in Dakota are as follows: - Create an LHS input sample based on the uncertain variable specification and run the user-specified model at the LHS points to compute the field responses. For notation purposes, there are d input parameters, N samples, and the field length is L. - Perform PCA on the covariance matrix of the data set from the previous step. This is done by first centering the data (e.g. subtracting the mean of each column from that column) and performing a singular value decomposition on the covariance matrix of the centered data. The eigenvectors of the covariance matrix correspond to the principal components. - Identify M principal components based on the percentage of variance explained. There is an optional keyword for ``principal_components`` called ``percent_variance_explained``, which is a threshold that determines the number of components that are retained to explain at least that amount of variance. For example, if the user specifies ``percent_variance_explained`` = 0.99, the number of components that accounts for at least 99 percent of the variance in the responses will be retained. The default for this percentage is 0.95. In many applications, only a few principal components explain the majority of the variance, resulting in significant data reduction. - Use the principal components in a predictive sense, by constructing a prediction approximation. The basis functions for this approximation are the principal components. The coefficients of the bases are obtained by constructing GP surrogates for the factor scores of the M principal components. The GP surrogates will be functions of the uncertain inputs. The idea is that we have just performed PCA on (for example) the covariance matrix of 100 samples. Typically, those 100 samples will be generated by sampling over some d uncertain input parameters denoted by u, so there should be a mapping from u to the field data, specifically to the loading coefficients and the factor scores. Currently, the final item printed from a Principal Components Analysis in Dakota is a set of prediction samples based on this prediction approximation or surrogate model that relies on the principal components. *Default Behavior* ``principal_components`` is turned off as a default. It may be used with either scalar responses or field responses, but it is intended to be used with large field responses as a data reduction method. For example, typically we expect the number of LHS samples, N, to be less than the number of field responses, L (e.g. if there is one field, the number of responses values is the length of that field). *Expected Outputs* When ``principal_components`` is specified, the number of significant principal components is printed along with the predictions based on the principal components. If ``output`` ``debug`` is specified, additional information is printed, including the original response matrix, the centered data, the principal components, and the factor scores. *Usage Tips* This is a preliminary capability that is undergoing active development. Please contact the Dakota developers team if you have problems with using this capability or want to suggest additional features. **Examples** .. code-block:: method, sampling sample_type lhs samples = 100 principal_components percent_variance_explained = 0.98 **Theory** There is an extensive statistical literature available on PCA. We recommend that the interested user peruse some of this in using the PCA capability.