Variable Selection and Modeling

VSMP Variable selection and modeling method based on the prediction... [Pg.510]

Narayanan and Gunturi [33] developed QSPR models based on in vivo blood-brain permeation data of 88 diverse compounds, 324 descriptors, and a systematic variable selection method called Variable Selection and Modeling method based on the Prediction (VSMP). VSMP efficiently explored all... [Pg.541]

Genetic programming, described earlier, picks only certain variables from the model. The rules, which may be in the form of a computer language such as Lisp, or easily interpretable equations, produce a formula from which a result can be calculated (e.g. if (measurement 1 >2.37 and measurement 2<0.53) or measurement 3 > 4.28 then sample is adulterated else sample is clean) [162 — 165]. Rather than being a pre-processing step before statistical analysis, this method combines the variable selection and model formation stages into one. [Pg.106]

Nicholls, A., MacCuish, N.E., and MacCuish, J.D. (2004) Variable selection and model validation of 2D and 3D molecular descriptors. Journal of Computer-Aided Molecular Design, 18, 451-474. [Pg.379]

VSMP A Novel Variable Selection and Modeling Method Based on the Prediction. [Pg.347]

Getting it Right Variable Selection and Model Validation in (Q)SAR. Spring National ACS... [Pg.348]

A number of performance criteria are not primarily dedicated to the users of a model but are applied in model generation and optimization. For instance, the mean squared error (MSE) or similar measures are considered for optimization of the number of components in PLS or PC A. For variable selection, the models to be compared have different numbers of variables in this case—and especially if a fit criterion is used—the performance measure must consider the number of variables appropriate measures are the adjusted squared correlation coefficient, adjR, or the Akaike S information criterion (AIC) see Section 4.2.3. [Pg.124]

The most reliable approach would be an exhaustive search among all possible variable subsets. Since each variable could enter the model or be omitted, this would be 2m - 1 possible models for a total number of m available regressor variables. For 10 variables, there are about 1000 possible models, for 20 about one million, and for 30 variables one ends up with more than one billion possibilities—and we are still not in the range for m that is standard in chemometrics. Since the goal is best possible prediction performance, one would also have to evaluate each model in an appropriate way (see Section 4.2). This makes clear that an expensive evaluation scheme like repeated double CV is not feasible within variable selection, and thus mostly only fit-criteria (AIC, BIC, adjusted R2, etc.) or fast evaluation schemes (leave-one-out CV) are used for this purpose. It is essential to use performance criteria that consider the number of used variables for instance simply R2 is not appropriate because this measure usually increases with increasing number of variables. [Pg.152]

Since an exhaustive search—eventually combined with exhaustive evaluation— is practically impossible, any variable selection procedure will mostly yield subopti-mal variable subsets, with the hope that they approximate the global optimum in the best possible way. A strategy could be to apply different algorithms for variable selection and save the best candidate solutions (typically 5-20 variable subsets). With this low number of potentially interesting models, it is possible to perform a detailed evaluation (like repeated double CV) in order to find one or several variables... [Pg.152]

The analysis of the crystal structures of Cox2 with a selective ligand SC-558 (PDB [66] code 1CX2) and the nonselective Coxl/Cox2 inhibitor flurbiprofen (PDB code 3PGH) fully confirms the expectedly weak match between the pharmacophore models built by variable selection and the actual Cox2 interaction map. [Pg.134]

An important aspect of variable selection that is often overlooked is the hazard brought about through the use of cross-validation for two quite different purposes namely (1) as an optimization criterion for variable selection and other model optimization tasks (including selection of the optimal number of PLS LVs or PCR PCs) and (2) as an assessment of the quality of the final model built using all samples. In this case, one can get highly optimistic estimates of a model s performance, because the same criterion is used to both optimize and evaluate the model. As a result, when doing variable selection, especially with a limited number of calibration samples, it is advisable to do an additional outer loop cross-validation across the entire model... [Pg.424]

Yasri, A. and Hartsough, D. (2001) Toward an optimal procedure for variable selection and QSAR model building. J. Chem. Inf. Comput. Sci. 41, 1218-1227. [Pg.211]

In many modeling techniques, the number of parameters is modified many times looking for a setting that provides the maximum predictive ability for the model. Techniques for variable selection and methods based on artificial neural networks perform an optimization, that is, they search for conditions able to provide the maximum predictive ability possible for a given sample subset. [Pg.96]

Data preprocessing is important in multivariate calibration. Indeed, the relationship between even basic procedures such as centring the columns is not always clear, most investigators following conventional methods, that have been developed for some popular application but are not always appropriately transferable. Variable selection and standardisation can have a significant influence on the performance of calibration models. [Pg.26]

Lin (1993) suggested using stepwise variable selection and Wu (1993) suggested forward selection or all (estimable) subsets selection. Lin (1993) gave an illustrative analysis by stepwise selection of the data in Table 6. He found that this identified factors 15,12,19,4, and 10 as the active factors, when their main effects are entered into the model in this order. Wang (1995) analyzed the other half of the Williams experiment and identified only one of the five factors that Lin had identified as being nonnegligible, namely, factor 4. [Pg.181]

PCM modeling aims to find an empirical relation (a PCM equation or model) that describes the interaction activities of the biopolymer-molecule pairs as accurate as possible. To this end, various linear and nonlinear correlation methods can be used. Nonlinear methods have hitherto been used to only a limited extent. The method of prime choice has been partial least-squares projection to latent structures (PLS), which has been found to work very satisfactorily in PCM. PCA is also an important data-preprocessing tool in PCM modeling. Modeling includes statistical model-validation techniques such as cross validation, external prediction, and variable-selection and signal-correction methods to obtain statistically valid models. (For general overviews of modeling methods see [10]). [Pg.294]

In general, model complexity is related to the nmnber of variables selected for modelling purposes. Let I be the vector of length p, where p is the total number of variables, constituted of p binary variables. Each variable takes a value equal to zero (ly = 0) if the yth variable is excluded and a value equal to one (ly = 1) if the th variable remains included within the variables. [Pg.295]

Some basic concepts and definitions of statistics, chemometrics, algebra, graph theory, similarity/diversity, which are fundamental tools in the development and application of molecular descriptors, are also presented in the Handbook in some detail. More attention has been paid to information content, multivariate correlation, model complexity, variable selection, and parameters for model quality estimation, as these are the characteristic components of modern QSAR/QSPR modelling. [Pg.680]

Marini F, Roncaglioni A, Novic M. Variable selection and interpretation in structure-affinity correlation modeling of estrogen receptor binders. J Chem Inf Model 2005 45 1507-19. [Pg.341]

Salt D.W., Maccari, L., Botta, M. and Ford, M.G. (2004) Variable selection and specification of robust QSAR models from multicollinear data arylpiperazinyl derivatives with affinity and selectivity for a2-adrenoceptors. J. Comput. Aid. Mol. Des., 18, 495-509. [Pg.1163]

Iman, R.L., Helton, J.C. and Campbell, J.E. An approach to sensitivity analysis of computer models Part I—Introduction, input variable selection and preliminary variable assessment. Journal of Quality Technology 1981 13 174— 183. [Pg.372]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...