Prediction SIMCA

Often the goal of a data analysis problem requites more than simple classification of samples into known categories. It is very often desirable to have a means to detect oudiers and to derive an estimate of the level of confidence in a classification result. These ate things that go beyond sttictiy nonparametric pattern recognition procedures. Also of interest is the abiUty to empirically model each category so that it is possible to make quantitative correlations and predictions with external continuous properties. As a result, a modeling and classification method called SIMCA has been developed to provide these capabihties (29—31). [Pg.425]

If care is not taken about the way j is obtained, SIMCA has a tendency to exclude more objects from the training class than necessary. The 5-value should be determined by cross-validation. Each object in the training set is then predicted, using the A- -dimensional PCA model obtained, for the other (n - 1) training set objects. The (residual) scores obtained in this way for each object are used in eq. (33.14) [30]. [Pg.230]

In SIMCA, we can determine the modelling power of the variables, i.e. we measure the importance of the variables in modelling the elass. Moreover, it is possible to determine the discriminating power, i.e. which variables are important to discriminate two classes. The variables with both low discriminating and modelling power are deleted. This is more a variable elimination procedure than a selection procedure we do not try to select the minimum number of features that will lead to the best classification (or prediction rate), but rather eliminate those that carry no information at all. [Pg.237]

The main classification methods for drug development are discriminant analysis (DA), possibly based on principal components (PLS-DA) and soft independent models for class analogy (SIMCA). SIMCA is based only on PCA analysis one PCA model is created for each class, and distances between objects and the projection space of PCA models are evaluated. PLS-DA is for example applied for the prediction of adverse effects by nonsteroidal anti-... [Pg.63]

Wolohan P.R.N. Clark R.D. Predicting drug pharmacokinetic properties using molecular interaction fields and SIMCA. Journal of Computer-Aided Molecular Design, 2003, 17 (1), 65-76. [Pg.72]

Another feature of SIMCA that is of considerable utility lies in the assistance the technique provides in selecting relevant variables. Information contained in the residuals, ei -, can be used to select variables relevant to the classification objective. If the residuals for a variable are not well predicted by the model, the standard deviation is large. An expression defined as modeling power has been defined to quantitatively express this relationship. The modeling power (MPOW) is defined as ... [Pg.206]

In the discussion that follows, the SIMCA method is illustrated by applying it to three problems (1) quality assurance of chromatography data, (2) classification of unknowns, and (3) predicting the composition of unknown samples. This third problem is one of deconvolution of a mixture and calculation of the relative concentration of the constituents (25. 38). [Pg.210]

SIMCA uses PCA to model the shape and position of the object formed by the samples in row space for class definition. A multidimensional box is constructed for each class and the classification of future samples (prediction) is performed by determining within which box, if any, the sample lies. Tltis is in contrast to KNN, where only the physical closeness of samples in space is used for ckssification. [Pg.72]

Steps a and 2 are discussed in detail in Sections 4.2,1 and 4.2.2 (PCA and HCA). In ep 3, the training set is divided into calibration and test sets to facilitate the estimation of the SIMCA models. Typically, we leave more than half of the dacEin the calibration set. It is also a good practice to repeat the calibration proo ure in Table 4.16 with different selections of calibration and test sets. An Amative to separate test sets is to implement some form of cross-validation. Bit example, Icave-one-out cross-validation can be performed where each sair is left out and predicted one at a time. [Pg.75]

Predict thcclass of the test set samples using the initial SIMCA models. [Pg.75]

There are many results to be reviewed because there are multiple classes for which SIMCA models are constructed and validated. The order in which to examine the results is a matter of preference, and many approaches are equally appropriate. We will review one SIMCA model at a time, and examine the test set predictions for that one model against samples from all classes. Ideal performance of a SIMCA model means that it includes as part of the class those samples that truly belong to the class and excludes those samples that are from all of the other classes. In reality, a number of classification scenarios are possible. Table A. 18 lists the possibilities along with possible root causes for misclassified test samples. [Pg.80]

All of She class C validation samples arc predicted to be outside of the class B SIMCA model. The minimum values are on the order of 100 and the maximum value is over 3000. Based on these results, it does not appear that there is aiw overlap of classes C and B. [Pg.82]

The final seep of constructing the SIMCA models is to merge the calibration and test s iples for each of the classes and reconstruct new SIMCA models using ail of e data. The rank and boundary parameters determined in Habit 4 are used fer the final models. These models are used to predict the class(cs) of unknown smples. Table 4.24 contains the values for three unknown samples where the empirically determined critical value is 1.6. From the values, the aaclusions are that unknown 1 is not a member of any class in the training ses unknown 2 is a member of class B, and tonknown 3 is a member of both classes A and B. [Pg.85]

P jjj has already validated the predictions to some extent. A very large P value (e.g., unknown 2 on class A has a calc of 11,000) indicates a very large difference ijetween the unknown sample and the calibration samples from that class. Unknown 1 has large P j values for all of the classes with most of the conisSbution coming from the PCA measurement residual. Unknown 2 is within the box of all SIMCA models but has large PCA contributions with classes A and C. Unknown 3 is excluded only from class C primarily because of a large contribution from the distance term (the expected value is zero). [Pg.85]

To test the models, the training set is divided into calibration and validation sets, as shown in Table 4.26. The predictive ability of the TE and MEK SIMCA models is then evaluated using samples from all 10 classes. [Pg.90]

Table 4.28 displays and for the three TEA test samples as a fianction of the rank of the TEA SIMCA model. For a rank of one or two, the validation samples are predicted to be in the class When the rank is three,... [Pg.91]

Examini F while changing rank confimts a rank one SIMCA model for MEK. Using rank one, the three validation samples are all predicted to be in the class, l g rank two or three, all validation samples are predicted to be outside ofiise MEK class (see Table 4.30). [Pg.92]

The resute of predictions using the two SIMCA models on four unknown samples are own in Table 4.31- These preliminary results indicate that unknowns 1 and 4 are not a member of either class, unknown 2 is MEK and unknown 3STEA. [Pg.93]

Scores Mot The score plot is used to examine the location of the samples in the PCS space. A three-dimensional PCA scores plot for TEA model is shown in iure 4.93. (Keep in mind that only the first two PCs were used to construct TEA SIMCA model.) The TEA training set samples and the four unknowns are shown on this plot. Unknown 3, which was predicted to be... [Pg.93]

Supervised pattern recognition methods are used for predicting the class of unkno-wm samples given a training set of samples with known class member-sliip. Tvksmethods are discussed in Section 4.3, KNN and SIMCA,... [Pg.95]

Habits 5 and 6 are not described because POV is not used in this section as a predictive tool. The super ised pattern-recognition technique, SIMCA, uses PCA for class prediction and the details of Habits 5 and 6 for SIMCA are presented in Section 4.3.2.1. [Pg.233]

Once the class boundaries are defined, it is important to determine whether any of the classes in the training set overlap. This indicates the discriminating power of the SIMCA models and will impact the confidence that can be placed on future predictions. TTiere are various algorithmic measures of class overlap and the reader is referred to their software package documentation for details. In this chapter, class overlap is indicated when training set samples are predicted to be members of multiple classes. This is demonstrated in a two-dimensional example shown in Figure 4.59- Two classes are shown where class A is described by one principal component and class B is described by two principal components. The overlap of the classes is indicated because unknown Z is classified as belonging to both classes. [Pg.252]

Wbat If" 1. Changing Class Volume. Consider using the class C samples to construct a SIMCA model and using the class A validation samples as a test set. Remember that class C is three-dimensional and A is one-dimensional and one end of the line lies close to class C. The prediction results using the default class size have been discussed and the F, values are plotted in Figure... [Pg.262]

The reason for these misclassifications is that the training sample residuals for the rank two SIMCA model are larger than the rank three model. Therefore, prediction samples can have a larger residual and still be considered to be members of class C. [Pg.263]

Summary of Prediction Diagnostic Tools for SIMCA From the prediction diagnostics, the conclusion is that unknow.as 1 and 4 do not belong to either of the TEA or MEK classes. Sample 3 is a member of the TEA class and sample 2 is a member of the MEK class. There is considerable reliability in the classifications due to the large values for the excluded samples both in the validation and prediction phases. The residuals and score plots are consistent with the values. [Pg.273]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...