PCA Models

If care is not taken about the way j is obtained, SIMCA has a tendency to exclude more objects from the training class than necessary. The 5-value should be determined by cross-validation. Each object in the training set is then predicted, using the A- -dimensional PCA model obtained, for the other (n - 1) training set objects. The (residual) scores obtained in this way for each object are used in eq. (33.14) [30]. [Pg.230]

PLS was originally proposed by Herman Wold (Wold, 1982 Wold et al., 1984) to address situations involving a modest number of observations, highly collinear variables, and data with noise in both the X- and Y-data sets. It is therefore designed to analyze the variations between two data sets, X, Y). Although PLS is similar to PCA in that they both model the A -data variance, the resulting X space model in PLS is a rotated version of the PCA model. The rotation is defined so that the scores of X data maximize the covariance of X to predict the Y-data. [Pg.36]

Fig. 35. Scores of Principal Component 1 vs scores of Principal Component 2 PCA model.

Such a result appears to be of major interest given that neither any classification of compounds, nor any training information was applied to the PCA model. A more detailed inspection of the score plot in Fig. 17.5 indicates that some compounds are misclassified, although experimental evaluation of these compounds revealed problems with their chemical stability or solubility. Thus, it appears that this model can be used to evaluate the false-positive (or false-negative) experiments. Moreover, it can also be used to evaluate the metabolic stability from the 3D structure of drug candidate prior to experimental measurements. [Pg.418]

The PCA model gives a representation of the centered (and scaled) data matrix... [Pg.91]

Outliers may heavily influence the result of PCA. Diagnostic plots help to find outliers (leverage points and orthogonal outliers) falling outside the hyper-ellipsoid which defines the PCA model. Essential is the use of robust methods that are tolerant against deviations from multivariate normal distributions. [Pg.114]

FIGURE 5.11 The Coomans plot uses the distances of the objects to the PCA models of two groups. It visualizes whether objects belong to one of the groups, to both, or to none. [Pg.226]

We come back to the problem of selecting the optimum dimensions a, ..., ak of the PCA models. This can be done with an appropriate evaluation technique like CV, and the goal is to minimize the total probability of misclassification. The latter can be obtained from the evaluation set, by computing the percentage of misclassified objects in each group, multiplied by the relative group size, and summarized over all groups. [Pg.226]

A number of methods model the object groups by PCA models of appropriate complexity (SIMCA, Section 5.3.1) or by Gaussian functions (Section 5.3.2). These methods are successful if the groups form compact clusters, they can handle... [Pg.260]

The main classification methods for drug development are discriminant analysis (DA), possibly based on principal components (PLS-DA) and soft independent models for class analogy (SIMCA). SIMCA is based only on PCA analysis one PCA model is created for each class, and distances between objects and the projection space of PCA models are evaluated. PLS-DA is for example applied for the prediction of adverse effects by nonsteroidal anti-... [Pg.63]

In practice, the choice of an optimal number of PCs to retain in the PCA model (A) is a rather snbjective process, which balances the need to explain as much of the original data as possible with the need to avoid incorporating too much noise into the PCA model (overfitting). The issne of overfitting is discnssed later in Section 12.4. [Pg.363]

As the above example illustrates, PCA can be an effective exploratory tool. However, it can also be used as a predictive tool in a PAT context. A good example of this nsage is the case where one wishes to determine whether newly collected analyzer responses are normal or abnormal with respect to previously collected responses. An efficient way to perform snch analyses wonld be to construct a PCA model using the previously collected responses, and apply this model to any analyzer response (Xp) generated by a subse-qnently-collected sample. Such PCA model application involves hrst a mnltiplication of the response vector with the PCA loadings (P) to generate a set of PCA scores for the newly collected response ... [Pg.365]

In linear algebra, this operation is called a projection of the response onto the space of the PCA model. Once this is done, the PCA-model-estimated response for the sample (xp) can be calculated ... [Pg.365]

This then allows calcnlation of the residual for the new sample (cp), which is simply the difference between the PCA-model-estimated response and the actual response ... [Pg.365]

Conceptnally, the residnal is the portion of the measured response that cannot be explained by the PCA model. [Pg.365]

Conceptually, the value for a given sample reflects the extremeness of that sample s response within the PCA model space, whereas the Q valne reflects the amonnt of the sample s response that is outside of the PCA model space. Therefore, both metrics are necessary to fnlly assess the abnormality of a response. In practice, before one can nse a PCA model as a monitor, one mnst set a confidence limit on each of these metrics. There are several methods for determining these confidence limits [30,31], bnt these nsually require two sets of information (1) the set of and Q values that are obtained when the calibration data (or a suitable set of independent test data) is applied to the PCA model, and (2) a user-specified level of confidence (e.g. 95%, 99%, or 99.999%). Of conrse, the latter is totally at the discretion of the nser, and is driven by the desired sensitivity and specificity of the monitoring application. [Pg.366]

A SIMCA model is actually an assembly of J class-specific PCA models, each of which is built using only the calibration samples of a single class. At that point, confidence levels for the Hotelling P and Q values (recall Equations 12.21 and 12.22) for each class can be determined independently. A SIMCA model is applied to an unknown sample by applying its analytical profile to each of the J PCA models, which leads to the generation of J sets of Hotelling P and Q statistics for that sample. At this point, separate assessments of the unknown sample s membership to each class can be made, based on the P and Q values for that sample, and the previously determined confidence levels. [Pg.396]

Although the development of a SIMCA model can be rather cumbersome, because it involves the development and optimization of J PCA models, the SIMCA method has several distinct advantages over other classification methods. First, it can be more robust in cases where the different classes involve discretely different analytical responses, or where the class responses are not linearly separable. Second, the treatment of each class separately allows SIMCA to better handle cases where the within-class variance structure is... [Pg.396]

Figure 12.27 (A) Scatter plot of the Hotelling P and Q residual statistics associated with the samples in the process spectroscopy calibration data set, obtained from a PCA model built on the data after obvious outliers were removed. The dashed lines represent the 95% confidence limit of the respective statistic. (B) The spectra used to generate the plot in (A), denoting one of the outlier samples.

The use of and Q prediction outlier metrics as described above is an example of a model-specific health monitor , in that the metrics refer to the specific analyzer response space that was used to develop a PLS, PCR or PCA prediction model. However, many PAT applications involve the deployment of multiple prediction models on a single analyzer. In such cases, one can also develop an analyzer-specific health monitor, where the and Q outlier metrics refer to a wider response space that covers all normal analyzer operation. This would typically be done by building a separate PCA model using a set of data that covers all normal analyzer responses. Of course, one could extend this concept further, and deploy multiple PCA health monitor models that are designed to detect different specific abnormal conditions. [Pg.431]

Figure 12.33 Time series plot of reduced V and Q statistics associated with the application of an analyzer-specific PCA model to on-line analyzer data - covering a period of approximately 4 months.

Root an Square Error of Cross Validation for PCA Plot (Model Diagnostic) Figuir 4.83 shows that the RMSECV PCA decreases significantly after the first and ssond PCs arc added, but the deercase is much smaller when additional PCs are aiifed. This implies that a two-component PCA. model is appropriate. [Pg.89]

There is no indication of unusual or outlying sample(s) in the residuals plotted in Figure 4.29. This indicates that the PCA model accounts for the systematic variation in all of the samples in the data set. [Pg.230]

Root Mean Square Error of Cross Validation for PCA Plot (Model Diagnostic) As described above, the residuals from a standard PCA calculation indicate how the PCA model fits the samples that were used to construction the PCA model. Specifically, they are the portion of the sample vectors that is not described by the model. Cross-validation residuals are computed in a different manner, A subset of samples is removed from the data set and a PCA model is constructed. Then the residuals for the left out samples are calculated (cross-validation residuals). The subset of samples is returned to the data set and the process is repeated for different subsets of samples until each sample has been excluded from the data set one time. These cross-validation residuals are the portion of the left out sample vectors that is not described by the PCA model constructed from an independent sample set. In this sense they are like prediction residuals (vs. fit). [Pg.230]

One advantage of the cross-validation residuals is that they are more sensitive to outliers. Because the left out samples do not influence the construaion of the PCA models, unusual samples will have inflated residuals. The cross-validation PCA models are also less prone to modeling noise in the data and therefore the resulting residuals better reflect the inherent noise in the data set. The identification and removal of outliers and better estimation of noise can provide a more realistic estimate of the inherent dimensionaliw of a data set. [Pg.230]

In Equation 4.2, RMSECV PCA, is a measure of the magnitude of the residuals for a PCA model using k principal components. The term residual// ) is the residual for the ith sample andyth measurement variable, nsamp is the number of samples, and nvars is the number of variables. [Pg.230]

To construct the multidimensional boxes, a training set of samples with known class ideniit) is obtained. The training set is divided into separate sets, one for each class, and principal components are calculated separately for each of the classes. The number of relevant principal components (rank) is determined for each class and the SIMCA models are completed by defining boundary regions for each of the PCA models. [Pg.251]

The residuals are that portion of the sample measurement vector not fit by the PCA model for a given class. [Pg.266]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...