Cross-validated multiple regression

Cramer, R.D., Bunce, J.D., Patterson, D.E. and Frank, I.E., Cross-validation, bootstrapping, and partial least squares compared with multiple regression in conventional QSAR studies, Quant. Struct.-Act. Relat., 7, 18-25, 1988. [Pg.179]

Figures 11 and 12 illustrate the performance of the pR2 compared with several of the currently popular criteria on a specific data set resulting from one of the drug hunting projects at Eli Lilly. This data set has IC50 values for 1289 molecules. There were 2317 descriptors (or covariates) and a multiple linear regression model was used with forward variable selection the linear model was trained on half the data (selected at random) and evaluated on the other (hold-out) half. The root mean squared error of prediction (RMSE) for the test hold-out set is minimized when the model has 21 parameters. Figure 11 shows the model size chosen by several criteria applied to the training set in a forward selection for example, the pR2 chose 22 descriptors, the Bayesian Information Criterion chose 49, Leave One Out cross-validation chose 308, the adjusted R2 chose 435, and the Akaike Information Criterion chose 512 descriptors in the model. Although the pR2 criterion selected considerably fewer descriptors than the other methods, it had the best prediction performance. Also, only pR2 and BIC had better prediction on the test data set than the null model.

There is an approach in QSRR in which principal components extracted from analysis of large tables of structural descriptors of analytes are regressed against the retention data in a multiple regression, i.e., principal component regression (PCR). Also, the partial least square (PLS) approach with cross-validation 29 finds application in QSRR. Recommendations for reporting the results of PC A have been published 130). [Pg.519]

Variable selection is performed by checking the squared multiple correlation coefficient or the corresponding cross-validated quantity from univariate regression models y = ho + the selection being made separately for each /th variable of the p variables xi, X2,..., Xp. [Pg.467]

Table 2 shows the results. In each case, multiple regression was used to combine variables from the CAT tasks to predict IQ. The multiple correlations obtained ranged from 0.63 to 0.93 with a mean of 0.79 and a standard deviation of 0.11. There are many ways to compute multiple regression with the wealth of data provided by CAT. Table 2 shows that several different methods were used including cross validation. Even when we use variables that have correlations of 0.25 or less with intelligence, we obtain a multiple correlation with IQ of 0.63. Cross validation yields a multiple correlation almost as high as that obtained in the original sample. [Pg.139]

The molecular descriptors for a CoMFA analysis number in the hundreds or thousands, even for datasets of twenty or so compounds. A multiple regression equation cannot be fitted for such a dataset. In such cases. Partial Least Squares (PLS) is the appropriate method. PLS unravels the relationship between log (1/C) and molecular properties by extracting from the data matrix linear combinations (latent variables) of molecular properties that best explain log (1/C). Because the individual properties are correlated (for example, steric properties at adjacent lattice points), more than one contributes to each latent variable. The first latent variable extracted explains most of the variance in log (1/C) the second the next greatest degree of variance, etc. At each step iP and s are calculated to help one decide when enough variables have been extracted—the maximum number of extracted variables is found when extracting another does not decrease x substantially. Cross-validation, discussed in Section 3.5.3, is commonly used to decide how many latent variables are significant. For example, Table 3.5 summarizes the CoMFA PLS analysis... [Pg.80]

R. D. Cramer 111, J. D. Bunce, and D. E. Patterson, Quant. Struct. Act. Relat., 7, 18 (1988). Cross-Validation, Bootstrapping, and PLS Compared with Multiple Linear Regression in Conventional QSAR Studies. [Pg.116]

For statistical reasons, multiple regression analysis cannot be used for 3D-QSAR methods that consider many more 3D descriptors than compounds or for which the descriptors are mutually correlated. The alternative strategies described next can be used to find a quantitative model in such situations. As will be seen, cross-validation is an important technique for assessing the robustness of a proposed model. [Pg.189]

Just as with multiple regression or PLS, neural nets can overfit the data. For neural networks, however, there is no simple procedure to assess this risk. For example, Manallack et al. performed simulations using both artificial and real QSAR data sets to compare multiple regression and neural networks. They found that the ability of the neural networks to predia the potency of compounds not included in the derivation of the model often was poorer than that of the regression methods. Cross-validation can help overcome this weakness of the former methods." ... [Pg.194]

CV = cross-validation MLR = multiple linear regression mp = melting point NIPALS = nonlinear iterative partial least squares NN = neural networks PCA = principal... [Pg.2006]

The simplest method for generating a statistical QSAR model is to use multiple linear regression (MLR), employing a number of calculated data to explain the measured activity data. Metrics, such as the F value and can be used to describe the statistical validity of the result, and cross-validated (q ) is used to measure the predictivity of the model. For small numbers of descriptors compared to the number of molecules, this method can provide adequate results. However, the number of properties that can be generated far exceeds the number of molecules in most analyses. In these cases, the MLR approach will rarely provide a QSAR model that is statistically predictive outside of the molecules used to build the model. A number of other techniques are provided in Tsar to handle these large datasets. [Pg.3341]

The final 15 predictors of the construct sample multiple regression equation resulted in a multiple R = 0.59 which subsequently shrank to an R = 0.47 upon cross-validation. None of the simulator event variables had even marginal significance and were excluded, therefore, from the final regression equation. The concurrent prediction equation correctly classified 68.9% of the accident-free drivers and 71.2% of the accident repeaters. This was approximately 20% better than chance prediction. Because of the contrasted criterion group design, however, these validities were overestimates of what would be attained on a normal population of drivers. Results on biographical data and psychomotor functions are listed in Tab 5.15. [Pg.145]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...