Out-crossing

Another method of detection of overfitting/overtraining is cross-validation. Here, test sets are compiled at run-time, i.e., some predefined number, n, of the compounds is removed, the rest are used to build a model, and the objects that have been removed serve as a test set. Usually, the procedure is repeated several times. The number of iterations, m, is also predefined. The most popular values set for n and m are, respectively, 1 and N, where N is the number of the objects in the primary dataset. This is called one-leave-out cross-validation. [Pg.223]

Oui recommendation is that one should use n-leave-out cross-validation, rather than one-leave-out. Nevertheless, there is a possibility that test sets derived thus would be incompatible with the training sets with respect to information content, i.e., the test sets could well be outside the modeling space [8]. [Pg.223]

The predictive power of the CPG neural network was tested with Icavc-one-out cross-validation. The overall percentage of correct classifications was low, with only 33% correct classifications, so it is clear that there are some major problems regarding the predictive power of this model. First of all one has to remember that the data set is extremely small with only 11 5 compounds, and has a extremely high number of classes with nine different MOAs into which compounds have to be classified. The second task is to compare the cross-validated classifications of each MOA with the impression we already had from looking at the output layers. [Pg.511]

This procedure is known as "leave one out" cross-validation. This is not the only way to do cross-validation. We could apply this approach by leaving out all permutations of any number of samples from the training set. The only constraint is the size of the training set, itself. Nonetheless, whenever the term cross-validation is used, it almost always refers to "leave one out" cross-validation. [Pg.108]

Many people use the term PRESS to refer to the result of leave-one-out cross-validation. This usage is especially common among the community of statisticians. For this reason, the terms PRESS and cross-validation are sometimes used interchangeably. However, there is nothing inate in the definition of PRESS that need restrict it to a particular set of predictions. As a result, many in the chemometrics community use the term PRESS more generally, applying it to predictions other than just those produced during cross-validation. [Pg.168]

When applied to QSAR studies, the activity of molecule u is calculated simply as the average activity of the K nearest neighbors of molecule u. An optimal K value is selected by the optimization through the classification of a test set of samples or by the leave-one-out cross-validation. Many variations of the kNN method have been proposed in the past, and new and fast algorithms have continued to appear in recent years. The automated variable selection kNN QSAR technique optimizes the selection of descriptors to obtain the best models [20]. [Pg.315]

It is usual to have the coefficient of determination, r, and the standard deviation or RMSE, reported for such QSPR models, where the latter two are essentially identical. The value indicates how well the model fits the data. Given an r value close to 1, most of the variahon in the original data is accounted for. However, even an of 1 provides no indication of the predictive properties of the model. Therefore, leave-one-out tests of the predictivity are often reported with a QSAR, where sequentially all but one descriptor are used to generate a model and the remaining one is predicted. The analogous statistical measures resulting from such leave-one-out cross-validation often are denoted as and SpR ss- Nevertheless, care must be taken even with respect to such predictivity measures, because they can be considerably misleading if clusters of similar compounds are in the dataset. [Pg.302]

Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained from leave-one-out cross-validation using PCR (o) and PLS ( ) regression.

Cuthbert, J.L. and McVetty, P.B.E. (2001). Plot-to-plot, row-to-row and plant-to-plant out-crossing studies in oilseed rape , Can. J. Plant Sci., 81, 657-664. [Pg.486]

Trost s group examined the possibility of carrying out cross-coupling reactions of alkynes and transformed this into a very powerful synthetic method. Either homocoupling or perhaps, more interesting, heterocoupling procedures were performed using catalytic amounts of a palladium salt (Equation (191)). [Pg.157]

Costanzo et al. <2002MI87> published the synthesis of the furyl-substituted pyrazolo[3,2-f][l,2,4]triazine derivative 146 by Suzuki coupling of the iodo compound 145 with 2-furylboronic acid. The yield was found to be moderate (38%). It may be important to mention that these authors also tried to transform 145 to a heteroaromatic boronic acid and to carry out cross-coupling of this compound with 3-bromofuran. Unfortunately, however, this approach failed and only homo-coupling occurred. [Pg.976]

Most of the QSAR-modeling methods implement the leave-one-out (LOO) (or leave-some-out) cross-validation procedure. The outcome from this procedure is a... [Pg.438]

For time-series data, the contiguous block method can provide a good assessment of the temporal stability of the model, whereas the Venetian blinds method can better assess nontemporal errors. For batch data, one can either specify custom subsets where each subset is assigned to a single batch (i.e., leave one batch out cross-validation), or use Venetian blinds or contiguous blocks to assess within-batch and between-batch prediction errors, respectively. For blocked data that contains replicates, one must be very careful with the Venetian blinds and contiguous block methods to select parameters such that the rephcate sample trap and the external subset traps, respectively, are avoided. [Pg.411]

Calculated descriptors have generally fallen into two broad categories those that seek to model an experimentally determined or physical descriptor (such as ClogP or CpKJ and those that are purely mathematical [such as the Kier and Hall connectivity indices (4)]. Not surprisingly, the latter category has been heavily populated over the years, so much so that QSAR/QSPR practitioners have had to rely on model validation procedures (such as leave-k-out cross-validation) to avoid models built upon chance correlation. Of course, such procedures are far less critical when very few descriptors are used (such as with the Hansch, Leo, and Abraham descriptors) it can even be argued that they are unnecessary. [Pg.262]

Figure 4.37. RMSECV PCA from leave-one-out cross-validation for PCA Example 2. [Pg.58]

The preprocessed data and class membership information is submitted to the analysis software. Euclidean distance and leave-one-out cross-validation is used to determine the value for K and the cutoff for G. [Pg.69]

Steps a and 2 are discussed in detail in Sections 4.2,1 and 4.2.2 (PCA and HCA). In ep 3, the training set is divided into calibration and test sets to facilitate the estimation of the SIMCA models. Typically, we leave more than half of the dacEin the calibration set. It is also a good practice to repeat the calibration proo ure in Table 4.16 with different selections of calibration and test sets. An Amative to separate test sets is to implement some form of cross-validation. Bit example, Icave-one-out cross-validation can be performed where each sair is left out and predicted one at a time. [Pg.75]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...