Regression model-based variable importance

At one time, scientists selected variables for inclusion in a regression model based on some type of strong justification for including them. That justification would have been based on some a priori idea or hypothesis that the variables chosen were the critical factors in establishing the variability in the response variable. Although this approach may still be used by scientists, more often today researchers have no prior knowledge about the relative importance of the various independent variables. Consequently, many regression... [Pg.322]

The reduction of latent variables is an effective method to reduce the number of possible models, yet in PLS, variable reduction is not needed. The reduction of the number of variables in traditional regression techniques will lead to models with improved predictive ability and, in the case of PLS, a model that is easier to understand. The attempts to reduce the number of variables for PLS have only resulted in simpler models that fit the Training Set better yet do not have the predictive abilities of the complete PLS model (111). The reduction of latent variables with respect to the descriptors is possible with no apparent decrease in the model s ability to predict bioactivities, yet the remaining descriptor-based variables are considered to be more important before reduction and thus introduces bias (111). [Pg.175]

One way to identify important predictor variables in a multiple regression setting is to do all possible regressions and choose the model based on some criteria, usually the coefficient of determination, adjusted coefficient of determination, or Mallows Cp. With this approach, a few candidate models are identified and then further explored for residual analysis, collinearity diagnostics, leverage analysis, etc. While useful, this method is rarely seen in the literature and cannot be advocated because the method is a dummy-ing down of the modeling process—the method relies too much on blind usage of the computer to solve a problem that should be left up to the modeler to solve. [Pg.64]

Some basic statistical requirements were also overlooked when quantitative models based on coefficients were developed. The most important assumption underlying the linear regression model is that the dependent variable contains all the errors in each data pair (Maes 1984 Irvin and Quickenden 1983). The range for 1-octanol/water partition coefficients obtained for each molecule clearly demonstrates that this basic assumption is not met. It has been demonstrated (York 1966) that if this basic assumption is violated, the fitted slopes can deviate by as much as 40% from the correct value. Thus, the validity and applicability of published quantitative models describing the relationship between the Kq coefficients and soil sorption coefficients are highly questionable. [Pg.320]

Apart from pharmacophore-based approaches, a variety of methods were applied to decipher important ligand features of PXR activation. VolSurf descriptor-based partial least squares (PLS) regression-based models pointed toward amide responsive regions that implicated good acceptor abilities as key variables [33]. [Pg.324]

Martin et al. [102] reported a study in which LIBS was applied for the first time to wood-based materials where preservatives containing metals had to be determined. They applied PLS-1 block and PLS-2 block (because of the interdependence of the analytes) to multiplicative scattered-corrected data (a data pretreatment option of most use when diffuse radiation is employed to obtain spectra). They authors studied the loadings of a PCA decomposition to identify the main chemical features that grouped samples. Unfortunately, they did not extend the study to the PLS factors. However, they analysed the regression coefficients to determine the most important variables for some predictive models. [Pg.235]

Whether or not any of the possible models are statistically significant is based on several important statistical parameters. Among them are the correlation coefficient (R), the standard error of estimate (.v), the value of the F-test of the overall significance (F), the values of f-test of significance of individual regre.ssion (t) and the cross-correlation coefficients between the independent variables employed in the same regression equation 27]. [Pg.518]

Statistical methods. Certainly one of the most important considerations in QSAR is the statistical analysis of the correlation of the observed biological activity with structural parameters - either the extrathermodynamic (Hansch) or the indicator variables (Free-Wilson). The coefficients of the structural parameters that establish the correlation with the biological activity can be obtained by a regression analysis. Since the models are constructed in terms of multiple additive contributions the method of solution is also called multiple linear regression analysis. This method is based on three requirements (223) i) the independent variables (structural parameters) are fixed variates and the dependent variable (biological activity) is randomly produced, ii) the dependent variable is normally and independently distributed for any set of independent variables, and iii) the variance of the dependent variable must be the same for any set of independent variables. [Pg.71]

The PLS model becomes identical to the MLR when the number of latent variables of a PLS derived model becomes equal to the number of actual independent variables, something that rarely happens as a consequence of model validation. The regression coefficients of the MLR model are straightforward to interpret while the PLS latent variables need to be retransformed into original variable space to be interpreted in a similar manner. This also means that the PLS regression coefficients are dimensional dependent, that is, they depend on how many latent variables (PLS components) are used. However, since each PLS component explains a decreasing amount of variance it is usually not that important if a PLS model is based on three or four components, which also means that the PLS regression coefficients will not differ very much between the three- and four-component models. [Pg.1012]

Nowadays, the most favored regression technique is Partial Least Squares Regression (PLS or PLSR). As happens with PCR, PLS is based on components (or latent variables ). The PLS components are computed by taking into account both the x and the y variables, and therefore they are slightly rotated versions of the Principal Components. As a consequence, their ranking order corresponds to the importance in the modeling of the response. A further difference with OLS and PCR is that, while the former must work on each response variable separately, PLS can be applied to multiple responses at the same time. [Pg.236]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...