LOO cross-validation

It is important to note that theoretic argument and empiric study have shown that the LOO cross-validation approach is preferred to the use of an external test set for small to moderate sized chemical databases [39]. The problems with holding out an external test set include (1) structural features of the held out chemicals are not included in the modeling process, resulting in a loss of information, (2) predictions are made only on a subset of the available compounds, whereas LOO predicts the activity value for all compounds, and (3) personal bias can easily be introduced in selection of the external test set. The reader is referred to Hawkins et al. [39] and Kraker et al. [40] in addition to Section 31.6 for further discussion of proper model validation techniques. [Pg.486]

The literature of the past three decades has witnessed a tremendous explosion in the use of computed descriptors in QSAR. But it is noteworthy that this has exacerbated another problem rank deficiency. This occurs when the number of independent variables is larger than the number of observations. Stepwise regression and other similar approaches, which are popularly used when there is a rank deficiency, often result in overly optimistic and statistically incorrect predictive models. Such models would fail in predicting the properties of future, untested cases similar to those used to develop the model. It is essential that subset selection, if performed, be done within the model validation step as opposed to outside of the model validation step, thus providing an honest measure of the predictive ability of the model, i.e., the true q2 [39,40,68,69]. Unfortunately, many published QSAR studies involve subset selection followed by model validation, thus yielding a naive q2, which inflates the predictive ability of the model. The following steps outline the proper sequence of events for descriptor thinning and LOO cross-validation, e.g.,... [Pg.492]

Thus, it is still uncommon to test QSAR models (characterized by a reasonably high q ) for their ability to predict accurately biological activities of compounds not included in the training set. In contrast to such expectations, it has been shown that if a test set with known values of biological activities is available for prediction, there exists no correlation between the LOO cross-validated and the correlation coefficient between the predicted and observed activities for the test set (Figure 16.1). In our experience [17, 28], this phenomenon is characteristic of many datasets and is independent of the descriptor types and optimization techniques used to develop training set models. In a recent review, we emphasized the importance of external validation in developing reliable models [18]. [Pg.440]

LOO cross-validation analysis of F-W QSAR models showed an overall agreement between predicted and experimental pICso for each individual combination of chemical series and protein target. [Pg.107]

The LOO cross-validated q oo values for the initial models was 0.875 using the water probe and 0.850 using the methyl probe. The application of the SRD/FFD variable selection resulted in an improvement of the significance of both models. The analysis yielded a correlation coefficient with a cross-validated q Loo of 0.937 for the water probe and 0.923 for the methyl probe. In addition we tested the reliability of the models by applying leave-20%-out and leave-50%-out cross-validation. Both models are also robust, indicated by high correlation coefficients of = 0.910 (water probe, SDEP = 0.409) and 0.895 (methyl probe, SDEP = 0.440) obtained by using the leave-50%-out cross-validation procedure. The statistical results gave confidence that the derived model could also be used for the prediction of novel compounds. [Pg.163]

Independent of these problems, we consider a rigorous assessment of the model quality as a fundamental issue. Many of the published models were not validated by an (external) test set. Even then, internal validation with LOO cross-validated tends to overestimate the predictive power of a model [133]. On the other hand, LGO cross-validation may suffer from a chance bias if only one random split is employed. External validation can not be afforded in many cases due to limited training samples available for model generation. [Pg.74]

One main strategy to test the predictivity of a relationship involves leaving some of the observations out of the modeling process and using the model to predict their potencies. In the leave-one-out (LOO) cross-validation process, each compound is deleted once, a model is developed from the remaining compounds, and the potency of the left-out compound is predicted. The deviation of the observed from this predicted is then used to calculate q, analogous to F . Other cross-validation strategies leave out more of the data on each round of calculation. [Pg.81]

Carrying out the leave-one-out (loo) cross-validation, gives the following Q2 values... [Pg.110]

Although LOO cross-validation is the most obvious choice, is it the best Unfortunately, it is not. Figure 6.10 shows a simple two-dimensional... [Pg.134]

A correct LOO cross-validation can be done by moving the delete-and-predict step inside the subset search loop. In other words, we take the sample of size n, remove one case, search for the best subset regression on the remaining n - cases, and apply this subset regression to predict the holdout case. Repeat for each of the cases in turn, gethng a true holdout prediction for each of the cases. Use these holdouts as a measure of the fit [38,39]. [Pg.162]

The use of SVM cannot solve all problems of noise in data processing, but it is possible to use SVM technique to improve noisy data processing in many ways. For example, it can provide some ways for outlier deleting By leave-one-out (LOO) cross-validation method, we can delete the data samples with large error in prediction, and make the improvement of data files. Besides, the adoption of e-insensitive loss function in support vector regression makes it more robust to noisy data sets. [Pg.6]

The result of data processing for the data set in Table 1.3 is quite similar. Although the rate of correctness in training process increases very quickly (it means that the structure of the data set is relatively simple and can be imitated by using ANN very easily), the minimum number of errors in prediction test (by LOO cross-validation method) of ANN is still higher than that of support vector machine, as shown in Fig. 1.4. [Pg.11]

Here the error denotes the averaged absolute error in LOO cross-validation test. [Pg.70]

As mentioned above, three artificially generated data sets have been preprocessed by feature selection. Based on this selection, we have performed multitask learning using the partial least square method with LOO cross-validation on these three data sets. The values of errors computed for different inputs and outputs are listed in Table 4.10. [Pg.72]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...