Multiple cross-validation

Controlling the complexity of a model is called regularization. To this end, hold-out data is important. In order to benefit from a training set that is as large as possible and still to be able to measure the performance on unseen data, cross validation is used. It does multiple iterations of training and testing on different partitionings of the data. Leave-one-out is certainly the most prominent concept here [154] however, other ways to partition are in use as well. [Pg.76]

The statistics reported for the tit are the number of compounds used in the model (n), the squared multiple correlation coefficient (R2), the cross-validated multiple correlation coefficient (R2Cv) the standard error of the fit (s), and the F statistic. The squared multiple correlation coefficient can take values between 0 (no fit at all) and 1 (a perfect fit) and when multiplied by 100 gives the percentage of variance in the response explained by the model (here 83%). This equation is actually quite a good fit to the data as can be seen by the plot of predicted against observed values shown in Figure 7.6. [Pg.172]

Cramer, R.D., Bunce, J.D., Patterson, D.E. and Frank, I.E., Cross-validation, bootstrapping, and partial least squares compared with multiple regression in conventional QSAR studies, Quant. Struct.-Act. Relat., 7, 18-25, 1988. [Pg.179]

Figures 11 and 12 illustrate the performance of the pR2 compared with several of the currently popular criteria on a specific data set resulting from one of the drug hunting projects at Eli Lilly. This data set has IC50 values for 1289 molecules. There were 2317 descriptors (or covariates) and a multiple linear regression model was used with forward variable selection the linear model was trained on half the data (selected at random) and evaluated on the other (hold-out) half. The root mean squared error of prediction (RMSE) for the test hold-out set is minimized when the model has 21 parameters. Figure 11 shows the model size chosen by several criteria applied to the training set in a forward selection for example, the pR2 chose 22 descriptors, the Bayesian Information Criterion chose 49, Leave One Out cross-validation chose 308, the adjusted R2 chose 435, and the Akaike Information Criterion chose 512 descriptors in the model. Although the pR2 criterion selected considerably fewer descriptors than the other methods, it had the best prediction performance. Also, only pR2 and BIC had better prediction on the test data set than the null model.

There is an approach in QSRR in which principal components extracted from analysis of large tables of structural descriptors of analytes are regressed against the retention data in a multiple regression, i.e., principal component regression (PCR). Also, the partial least square (PLS) approach with cross-validation 29 finds application in QSRR. Recommendations for reporting the results of PC A have been published 130). [Pg.519]

Variable selection is performed by checking the squared multiple correlation coefficient or the corresponding cross-validated quantity from univariate regression models y = ho + the selection being made separately for each /th variable of the p variables xi, X2,..., Xp. [Pg.467]

Table 2 shows the results. In each case, multiple regression was used to combine variables from the CAT tasks to predict IQ. The multiple correlations obtained ranged from 0.63 to 0.93 with a mean of 0.79 and a standard deviation of 0.11. There are many ways to compute multiple regression with the wealth of data provided by CAT. Table 2 shows that several different methods were used including cross validation. Even when we use variables that have correlations of 0.25 or less with intelligence, we obtain a multiple correlation with IQ of 0.63. Cross validation yields a multiple correlation almost as high as that obtained in the original sample. [Pg.139]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...