Validity of a PLS model

Depending on the number of components, often perfect correlations are obtained in PLS analyses, due to the usually large number of included X variables. Correspondingly, the quality of fit is no criterion for the validity of a PLS model. A cross-validation procedure (Figure 34) [26, 409, 611] must be used to select the model having the highest predictive ability. In cross-validation many PLS runs are... [Pg.102]

Fig. 12. A PRESS plot for a leverage validation of a PLS model built from a training set of NIR diffuse reflectance spectra of SO samples of wheat. With this data set, no minimum is reached within the 20 factors calculated.

Fig. 15. A plot of the spectral residual versus sample number for a cross-validation of a PLS model of research octane number built from 57 NIR spectra of gasoline. Notice that sample No. 45 has a significantly different residual than the remainder of the set, indicating that it is a possible outlier.

S. Gourvenec, J. A. Fernandez-Pierna, D. L. Massart and D. N. Rutledge, An evaluation of the PoLiSh smoothed regression and the Monte Carlo cross-validation for the determination of the complexity of a PLS model, Chemom. Intell. Lab. Syst., 68, 2003, 41-51. [Pg.238]

Scheme I Schematic representation of (a) PLS model (b) UVE-PLS model and (c) matrix B, containing regression coefficients calculated by the leave-one-out cross-validation procedure and their mean values, standard deviations, and stability.

Critical validation of the PLS models is essential. Explained variance (R X) and goodness of fit (R Y) are important parameters but not sufficient. Goodness of prediction (root mean square error [RMSE] or Q ) obtained after cross-validation [22] is essential to avoid overfit and correlations that are simply due to chance. A final validation should be performed by distinguishing between a training set and a test set. The training set is used to create a PLS model, which is subsequently used to predict values obtained from an independent test set. Furthermore, repeatability and reproducibility should be evaluated using repeated analyses and parallel samples. [Pg.757]

Figure 18.2 Representative receiver operator curves to demonstrate the leave n out validation of K-PLS classification models (metabolite formed or not formed) derived with approximately 300 molecules and over 60 descriptors. The diagonal line represents random. The horizontal axis represents the percentage of false positives and the vertical axis the percentage of false negatives in each case. a. Al-dealkylation. b. O-dealkylation. c. Aromatic hydroxylation. d. Aliphatic hydroxylation. e. O-glucuronidation. f. O-sulfation. Data generated in collaboration with Dr. Mark Embrechts (Rensselaer Polytechnic Institute).

Unfortunately, the ANN method is probably the most susceptible to overfitting of the methods discussed thus far. For similar N and M, ANNs reqnire many more parameters to be estimated in order to define the model. In addition, cross validation can be very time-consuming, as models with varying complexity (nnmber of hidden nodes) mnst be trained individually before testing. Also, the execntion of an ANN model is considerably more elaborate than a simple dot product, as it is for MLR, CLS, PCR and PLS (Eqnations 12.34, 12.37, 12.43 and 12.46). Finally, there is very little, or no, interpretive value in the parameters of an ANN model, which eliminates one nseful means for improving the confidence of a predictive model. [Pg.388]

Summary of Validation Diagnostic Tools for PLS/PCR. Example 1 Eleven samples were used to build a PLS model to predict the concentration of component A with -.. ying amounts of components B and C. The details of the data set u.scd to corjstruct the calibration model are as follows ... [Pg.157]

The optimal complexity of the PLS model, that is, the most appropriate number of latent variables, is determined by evaluating, with a proper validation strategy (see Section Vl.F), the prediction error corresponding to models with increasing complexity. The parameter considered is usually the standard deviation of the error of calibration (SDEC), if computed with the objects used for building the model, or the standard deviation of the error of prediction (SDEP), if computed with objects not used for building the model (see Section Vl.F). [Pg.95]

We can summarise some other ideas for evaluating the predictive ability of the PLS model. First, you can compare the average error (RMSEP) with the concentration levels of the standards (in calibration) and evaluate whether you (or your client) can accept the magnitude of this error (fit-for-purpose). Then, it is interesting to calculate the so-called "ratio of prediction to deviation", which is just RPD=SD/SEP, where, SD is the standard deviation of the concentrations of the validation samples and SEP is the bias-corrected standard error of prediction (for SEP, see Section 4.6 for more details). As a rule of thumb, an RPD ratio lower than 3 suggests the model has poor predictive capabilities [54]. [Pg.222]

Another potential disadvantage of PLS over PCR is that there is a higher potential to overfit the model through the use of too many PLS factors, especially if the Y-data are rather noisy. In such situations, one could easily run into a situation where the addition of a PLS factor helps to explain noise in the Y-data, thus improving the model fit without an improvement in real predictive ability. As for all other quantitative regression methods, the use of validation techniques is critical to avoid model overfitting (see Section 8.3.7). [Pg.263]

Two non-parametric methods for hypothesis testing with PCA and PLS are cross-validation and the jackknife estimate of variance. Both methods are described in some detail in the sections describing the PCA and PLS algorithms. Cross-validation is used to assess the predictive property of a PCA or a PLS model. The distribution function of the cross-validation test-statistic cvd-sd under the null-hypothesis is not well known. However, for PLS, the distribution of cvd-sd has been empirically determined by computer simulation technique [24] for some particular types of experimental designs. In particular, the discriminant analysis (or ANOVA-like) PLS analysis has been investigated in some detail as well as the situation with Y one-dimensional. This simulation study is referred to for detailed information. However, some tables of the critical values of cvd-sd at the 5 % level are given in Appendix C. [Pg.312]

The PLS pseudo-coefficients profile of the third component of the PLS model, highlights the descriptors that have a greater importance in the chemometric model. The most important 3D-pharmacophoric descriptors in the PLS model suggest a common pharmacophore for all the substrates. The activity increases strongly in molecules ivith a high value of the descriptors 33-23, 11-33, 13-8, 14-41, 44-43. The descriptors are explained in detail in Table 9.1. The most important descriptors in the PLS model can be arranged to obtain an approximate pharmacophore valid for molecules actively transported by Pgp. The pharmacophore consists of two H-bond acceptor groups, two hydrophobic areas and the size of the molecule that plays a major role in the interaction (Fig. 9.3). [Pg.202]

The PLS model becomes identical to the MLR when the number of latent variables of a PLS derived model becomes equal to the number of actual independent variables, something that rarely happens as a consequence of model validation. The regression coefficients of the MLR model are straightforward to interpret while the PLS latent variables need to be retransformed into original variable space to be interpreted in a similar manner. This also means that the PLS regression coefficients are dimensional dependent, that is, they depend on how many latent variables (PLS components) are used. However, since each PLS component explains a decreasing amount of variance it is usually not that important if a PLS model is based on three or four components, which also means that the PLS regression coefficients will not differ very much between the three- and four-component models. [Pg.1012]

After elimination of uninformative variables, a PLS model for the y and Xnew data is constructed, using the leave-one-out cross validation procedure to estimate its complexity. [Pg.331]

Using five scales (i.e. a total of 2 + - 1 = 63 variables) the prediction error was 2.8% on the validation set. This is a significant reduction (93%) in the model complexity from using 882 variables. To get an indication of how the complexity of the PLS model changes with resolution level, an SEC surface was generated, see Fig. 26. Here it can be seen that the optimal model (scale = 5, i.e. a total of 63 variables, using A = 7 PLS factors) is as expected located in the direction of the lower left corner of the SEC surface. For the prediction... [Pg.400]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...