Regression cross-validation

M. Stone and R.J. Brooks, Continuum regression cross-validated sequentially constructed prediction embracing ordinary least sqaures, partial least squares, and principal component regression. J. Roy. Stat. Soc. B52 (1990) 237-269. [Pg.347]

M. Stone, R.J. Brooks, Continuum Regression Cross-Validated Sequentially... [Pg.20]

M. Stone and R. J. Brooks, Continuum Regression Cross-validated SequentiaUy-consIructedPrediction Embracing Ordinary Least Squares, Partial Least Squares, and Principal Component Regression, J. R. Stat. Soc. B., 52 337-369 (1990). [Pg.229]

A crucial decision in PLS is the choice of the number of principal components used for the regression. A good approach to solve this problem is the application of cross-validation (see Section 4.4). [Pg.449]

The second task discussed is the validation of the regression models with the aid of the cross-validation (CV) procedures. The leave-one-out (LOO) as well as the leave-many-out CV methods are used to evaluate the prognostic possibilities of QSAR. In the case of noisy and/or heterogeneous data the LM method is shown to exceed sufficiently the LS one with respect to the suitability of the regression models built. The especially noticeable distinctions between the LS and LM methods are demonstrated with the use of the LOO CV criterion. [Pg.22]

Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained from leave-one-out cross-validation using PCR (o) and PLS ( ) regression.

Figure 18 Regression equation obtained from partial-least-squares data cross-validated in Figure 17.

The literature of the past three decades has witnessed a tremendous explosion in the use of computed descriptors in QSAR. But it is noteworthy that this has exacerbated another problem rank deficiency. This occurs when the number of independent variables is larger than the number of observations. Stepwise regression and other similar approaches, which are popularly used when there is a rank deficiency, often result in overly optimistic and statistically incorrect predictive models. Such models would fail in predicting the properties of future, untested cases similar to those used to develop the model. It is essential that subset selection, if performed, be done within the model validation step as opposed to outside of the model validation step, thus providing an honest measure of the predictive ability of the model, i.e., the true q2 [39,40,68,69]. Unfortunately, many published QSAR studies involve subset selection followed by model validation, thus yielding a naive q2, which inflates the predictive ability of the model. The following steps outline the proper sequence of events for descriptor thinning and LOO cross-validation, e.g.,... [Pg.492]

The second possibility is called cross-validation. The test samples are measured by a reference method. Howevep the reference method cannot provide true values because measurement error occurs here as well. Nevertheless, well-characterized methods can provide generally accepted values, which are then compared to the ones obtained using the test calibration. Note that this comparison must be done using particularly suited regression methods, because the error for both methods will be in the same order of magnitude. Especially, cross-validation of CE and HPLC has been frequently reported. [Pg.239]

Acoustic spectra were calibrated using PLS regression with six ammonia concentration levels. The reference concentration levels were 0, 0.5, 1, 2, 5 and 8% ammonia, with five replicate measurements at each level. Figure 9.23 shows the PLS-R prediction results validated with two-segment cross-validation (commonly referred to as a test set switch). [Pg.299]

Like ANNs, SVMs can be useful in cases where the x-y relationships are highly nonlinear and poorly nnderstood. There are several optimization parameters that need to be optimized, including the severity of the cost penalty , the threshold fit error, and the nature of the nonlinear kernel. However, if one takes care to optimize these parameters by cross-validation (Section 12.4.3) or similar methods, the susceptibility to overfitting is not as great as for ANNs. Furthermore, the deployment of SVMs is relatively simpler than for other nonlinear modeling alternatives (such as local regression, ANNs, nonlinear variants of PLS) because the model can be expressed completely in terms of a relatively low number of support vectors. More details regarding SVMs can be obtained from several references [70-74]. [Pg.389]

Each of the regression models is evaluated for prediction abihty, typically using cross validation. [Pg.424]

The optimal number of components from the prediction point of view can be determined by cross-validation (10). This method compares the predictive power of several models and chooses the optimal one. In our case, the models differ in the number of components. The predictive power is calculated by a leave-one-out technique, so that each sample gets predicted once from a model in the calculation of which it did not participate. This technique can also be used to determine the number of underlying factors in the predictor matrix, although if the factors are highly correlated, their number will be underestimated. In contrast to the least squares solution, PLS can estimate the regression coefficients also for underdetermined systems. In this case, it introduces some bias in trade for the (infinite) variance of the least squares solution. [Pg.275]

For partial least-squares (PLS) or principal component regression (PCR), the infrared spectra were transferred to a DEC VAX 11/750 computer via the NIC-COM software package from Nicolet. This package also provided utility routines used to put the spectra into files compatible with the PLS and PCR software. The PLS and PCR program with cross-validation was provided by David Haaland of Sandia National Laboratory. A detailed description of the program and the procedures used in it has been given (5). [Pg.47]

The pioneers of bioavailability modeling can be traced back to year 2000. Andrews and coworkers [57] developed a regression model to predict bioavailability for 591 compounds. Compared to the Lipinski s "Rule of Five," the false negative predictions were reduced from 5% to 3%, while the false positive predictions decreased from 78% to 53%. The model could achieve a relatively good correlation (r2 = 0.71) for the training set. But when 80/20 cross-validation was applied, the correlation was decreased to q1 — 0.58. [Pg.114]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...