Cross-validation problem

A crucial decision in PLS is the choice of the number of principal components used for the regression. A good approach to solve this problem is the application of cross-validation (see Section 4.4). [Pg.449]

The predictive power of the CPG neural network was tested with Icavc-one-out cross-validation. The overall percentage of correct classifications was low, with only 33% correct classifications, so it is clear that there are some major problems regarding the predictive power of this model. First of all one has to remember that the data set is extremely small with only 11 5 compounds, and has a extremely high number of classes with nine different MOAs into which compounds have to be classified. The second task is to compare the cross-validated classifications of each MOA with the impression we already had from looking at the output layers. [Pg.511]

The most serious problem with ensemble average approaches is that they introduce many more parameters into the calculation, making the parameter-to-observable ratio worse. The effective number of parameters has to be restrained. This can be achieved by using only a few confonners in the ensemble and by determining the optimum number of confonners by cross-validation [83]. A more indirect way of restraining the effective number of parameters is to restrict the conformational space that the molecule can search... [Pg.269]

A similar problem arises with present cross-validated measures of fit [92], because they also are applied to the final clean list of restraints. Residual dipolar couplings offer an entirely different and, owing to their long-range nature, very powerful way of validating structures against experimental data [93]. Similar to cross-validation, a set of residual dipolar couplings can be excluded from the refinement, and the deviations from this set are evaluated in the refined structures. [Pg.271]

Cross-Validation methods make use of the fact that it is possible to estimate missing measurements when the solution of an inverse problem is obtained. [Pg.414]

Generalized cross-validation. To overcome some problems with ordinary cross-validation, Golub et al. (1979) have proposed the generalized cross-validation (GCV) which is a weighted version of CV ... [Pg.415]

L Stable and S. Wold, Partial least square analysis with cross-validation for the two-class problem a Monte Carlo study. J. Chemometrics, 1 (1987) 185-196. [Pg.241]

It is important to note that theoretic argument and empiric study have shown that the LOO cross-validation approach is preferred to the use of an external test set for small to moderate sized chemical databases [39]. The problems with holding out an external test set include (1) structural features of the held out chemicals are not included in the modeling process, resulting in a loss of information, (2) predictions are made only on a subset of the available compounds, whereas LOO predicts the activity value for all compounds, and (3) personal bias can easily be introduced in selection of the external test set. The reader is referred to Hawkins et al. [39] and Kraker et al. [40] in addition to Section 31.6 for further discussion of proper model validation techniques. [Pg.486]

The literature of the past three decades has witnessed a tremendous explosion in the use of computed descriptors in QSAR. But it is noteworthy that this has exacerbated another problem rank deficiency. This occurs when the number of independent variables is larger than the number of observations. Stepwise regression and other similar approaches, which are popularly used when there is a rank deficiency, often result in overly optimistic and statistically incorrect predictive models. Such models would fail in predicting the properties of future, untested cases similar to those used to develop the model. It is essential that subset selection, if performed, be done within the model validation step as opposed to outside of the model validation step, thus providing an honest measure of the predictive ability of the model, i.e., the true q2 [39,40,68,69]. Unfortunately, many published QSAR studies involve subset selection followed by model validation, thus yielding a naive q2, which inflates the predictive ability of the model. The following steps outline the proper sequence of events for descriptor thinning and LOO cross-validation, e.g.,... [Pg.492]

Cross-validation is an alternative to the split-sample method of estimating prediction accuracy (5). Molinaro et al. describe and evaluate many variants of cross-validation and bootstrap re-sampling for classification problems where the number of candidate predictors vastly exceeds the number of cases (13). The cross-validated prediction error is an estimate of the prediction error associated with application of the algorithm for model building to the entire dataset. [Pg.334]

Often the number of samples for calibration is limited and it is not possible to split the data into a calibration set and a validation set containing representative samples that are representative enough for calibration and for validation. As we want a satisfactory model that predicts future samples well, we should include as many different samples in the calibration set as possible. This leads us to the severe problem that we do not have samples for the validation set. Such a problem could be solved if we were able to perform calibration with the whole set of samples and validation as well (without predicting the same samples that we have used to calculate the model). There are different options but, roughly speaking, most of them can be classified under the generic term "cross-validation . More advanced discussions can be found elsewhere [31-33]. [Pg.205]

Problem 4.13 Determining the Number of Significant Components in a Dataset by Cross-validation... [Pg.268]

Perform standardised cross-validated PLS on the data (note that the property parameter should be mean centred but there is no need to standardise), calculating five PLS components. If you are not using the Excel Add-in the following steps may be required, as described in more detail in Problem 5.8 ... [Pg.323]

Although cross-validation is always performed on the preprocessed data, the RSS and PRESS values are always calculated on the x block in the original units, as discussed in Chapter 4, Section 4.33.2. The reason for this relates to rather complex problems that occur when standardising a column after one sample has been removed. There are, of course, many other possible approaches. When performing cross-validation, the only output available involves error analysis. [Pg.452]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...