Cross validated

An observation of the results of cross-validation revealed that all but one of the compounds in the dataset had been modeled pretty well. The last (31st) compound behaved weirdly. When we looked at its chemical structure, we saw that it was the only compound in the dataset which contained a fluorine atom. What would happen if we removed the compound from the dataset The quahty ofleaming became essentially improved. It is sufficient to say that the cross-vahdation coefficient in-CTeased from 0.82 to 0.92, while the error decreased from 0.65 to 0.44. Another learning method, the Kohonen s Self-Organizing Map, also failed to classify this 31st compound correctly. Hence, we had to conclude that the compound containing a fluorine atom was an obvious outlier of the dataset. [Pg.206]

Another method of detection of overfitting/overtraining is cross-validation. Here, test sets are compiled at run-time, i.e., some predefined number, n, of the compounds is removed, the rest are used to build a model, and the objects that have been removed serve as a test set. Usually, the procedure is repeated several times. The number of iterations, m, is also predefined. The most popular values set for n and m are, respectively, 1 and N, where N is the number of the objects in the primary dataset. This is called one-leave-out cross-validation. [Pg.223]

Oui recommendation is that one should use n-leave-out cross-validation, rather than one-leave-out. Nevertheless, there is a possibility that test sets derived thus would be incompatible with the training sets with respect to information content, i.e., the test sets could well be outside the modeling space [8]. [Pg.223]

A crucial decision in PLS is the choice of the number of principal components used for the regression. A good approach to solve this problem is the application of cross-validation (see Section 4.4). [Pg.449]

The predictive power of the CPG neural network was tested with Icavc-one-out cross-validation. The overall percentage of correct classifications was low, with only 33% correct classifications, so it is clear that there are some major problems regarding the predictive power of this model. First of all one has to remember that the data set is extremely small with only 11 5 compounds, and has a extremely high number of classes with nine different MOAs into which compounds have to be classified. The second task is to compare the cross-validated classifications of each MOA with the impression we already had from looking at the output layers. [Pg.511]

The maximum number of latent variables is the smaller of the number of x values or the number of molecules. However, there is an optimum number of latent variables in the model beyond which the predictive ability of the model does not increase. A number of methods have been proposed to decide how many latent variables to use. One approach is to use a cross-validation method, which involves adding successive latent variables. Both leave-one-out and the group-based methods can be applied. As the number of latent variables increases, the cross-validated will first increase and then either reach a plateau or even decrease. Another parameter that can be used to choose the appropriate number of latent variables is the standard deviation of the error of the predictions, SpREss ... [Pg.725]

The second task discussed is the validation of the regression models with the aid of the cross-validation (CV) procedures. The leave-one-out (LOO) as well as the leave-many-out CV methods are used to evaluate the prognostic possibilities of QSAR. In the case of noisy and/or heterogeneous data the LM method is shown to exceed sufficiently the LS one with respect to the suitability of the regression models built. The especially noticeable distinctions between the LS and LM methods are demonstrated with the use of the LOO CV criterion. [Pg.22]

The most serious problem with ensemble average approaches is that they introduce many more parameters into the calculation, making the parameter-to-observable ratio worse. The effective number of parameters has to be restrained. This can be achieved by using only a few confonners in the ensemble and by determining the optimum number of confonners by cross-validation [83]. A more indirect way of restraining the effective number of parameters is to restrict the conformational space that the molecule can search... [Pg.269]

A similar problem arises with present cross-validated measures of fit [92], because they also are applied to the final clean list of restraints. Residual dipolar couplings offer an entirely different and, owing to their long-range nature, very powerful way of validating structures against experimental data [93]. Similar to cross-validation, a set of residual dipolar couplings can be excluded from the refinement, and the deviations from this set are evaluated in the refined structures. [Pg.271]

SJ Cho, A Tropsha. Cross-validated R2-guided region selection for comparative molecular held analysis A simple method to achieve consistent results. J Med Chem 38 1060-1066, 1995. [Pg.367]

During the selection of the number of hidden layer neurons, the desired tolerance should also be considered. In general, a tight tolerance requires that the selected network be trained with fewer hidden neurons. As mentioned earlier, cross-validation during training can be used to monitor the error progression, which subsequently serves as a guideline in the selection of the hidden layer neurons. [Pg.10]

Sometimes it is just not feasible to assemble any validation samples. In such cases there are still other tests, such as cross-validation, which can help us do a certain amount of validation of a calibration. However, these tests do not provide the level of information nor the level of confidence that we should have before placing a calibration into service. More about this later. [Pg.23]

Indicator functions have the advantage that they can be used on data sets for which no concentration values (y-data) are available. But cross-validation and, especially PRESS, can often provide more reliable guidance. [Pg.103]

Cross-validation. We don t always have a sufficient set of independent validation samples with which to calculate PRESS. In such instances, we can use the original training set to simulate a validation set. This approach is called cross-validation. The most commom form of cross-validation is performed as follows ... [Pg.107]

This procedure is known as "leave one out" cross-validation. This is not the only way to do cross-validation. We could apply this approach by leaving out all permutations of any number of samples from the training set. The only constraint is the size of the training set, itself. Nonetheless, whenever the term cross-validation is used, it almost always refers to "leave one out" cross-validation. [Pg.108]

If we did not have a validation set available to us, we could use cross-validation for the same purposes. Figure 55 contains plots of the results of cross validation of the two training sets, A1 and A2. Since no separate validation data set is involved, we name the results PCRCROSS1 and PCRCROSS2, respectively. [Pg.115]

Figure 55. Logarithmic plots of the cross-validation results as a function of the number of factors (rank) used to construct the calibration.

So, cross-validation and PRESS both indicate that we should use 5 factors for our calibrations. This indication is sufficiently consistent with the F-test on the REV" and with our "eyeball" inspection of the EV s and REV s, themselves. It can also be worthwhile to look at the eigenvectors themselves. [Pg.117]

Many people use the term PRESS to refer to the result of leave-one-out cross-validation. This usage is especially common among the community of statisticians. For this reason, the terms PRESS and cross-validation are sometimes used interchangeably. However, there is nothing inate in the definition of PRESS that need restrict it to a particular set of predictions. As a result, many in the chemometrics community use the term PRESS more generally, applying it to predictions other than just those produced during cross-validation. [Pg.168]

In this book, the term PRESS is used only for the case where the calibration was generated with one data set and the predictions were made on an independent data. The term CROSS is used to denote the PRESS computed during cross-validation. This was done to in an attempt to distinguish cross-validation from other means of validation. [Pg.168]

Cross-validation of PCR calibration from Al/Cl for A3 Cross-validation of PCR calibration from A2/C2 for A3... [Pg.198]

Correlation coefficient, 60 Cross-validation, 106 Cynicism, 5, 23 Data... [Pg.201]

Cross-Validation methods make use of the fact that it is possible to estimate missing measurements when the solution of an inverse problem is obtained. [Pg.414]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...