Best Variable Subset Selection

Partition-based modeling methods are also called subset selection methods because they select a smaller subset of the most relevant inputs. The resulting model is often physically interpretable because the model is developed by explicitly selecting the input variable that is most relevant to approximating the output. This approach works best when the variables are independent (De Veaux et al., 1993). The variables selected by these methods can be used as the analyzed inputs for the interpretation step. [Pg.41]

The following three performance measures are commonly used for variable selection by stepwise regression or by best-subset regression. An example in Section 4.5.8 describes use and comparison of these measures. [Pg.129]

The most reliable approach would be an exhaustive search among all possible variable subsets. Since each variable could enter the model or be omitted, this would be 2m - 1 possible models for a total number of m available regressor variables. For 10 variables, there are about 1000 possible models, for 20 about one million, and for 30 variables one ends up with more than one billion possibilities—and we are still not in the range for m that is standard in chemometrics. Since the goal is best possible prediction performance, one would also have to evaluate each model in an appropriate way (see Section 4.2). This makes clear that an expensive evaluation scheme like repeated double CV is not feasible within variable selection, and thus mostly only fit-criteria (AIC, BIC, adjusted R2, etc.) or fast evaluation schemes (leave-one-out CV) are used for this purpose. It is essential to use performance criteria that consider the number of used variables for instance simply R2 is not appropriate because this measure usually increases with increasing number of variables. [Pg.152]

Since an exhaustive search—eventually combined with exhaustive evaluation— is practically impossible, any variable selection procedure will mostly yield subopti-mal variable subsets, with the hope that they approximate the global optimum in the best possible way. A strategy could be to apply different algorithms for variable selection and save the best candidate solutions (typically 5-20 variable subsets). With this low number of potentially interesting models, it is possible to perform a detailed evaluation (like repeated double CV) in order to find one or several variables... [Pg.152]

In many cases, it is possible to use only a subset of the p variables X without a serious loss predictive ability with the forward stepwise regression (SWMLR) procedure, which, in each step, selects the predictor that more increases the variation explained and verifies if a previously selected predictor can be removed (values for F-statistics to enter and to remove variables should be fixed) or with the best subsets regression procedure. [Pg.709]

Linear Models. Variable selection approaches can be applied in combination with both linear and nonlinear optimization algorithms. Exhaustive analysis of all possible combinations of descriptor subsets to find a specific subset of variables that affords the best correlation with the target property is practically impossible because of the combinatorial nature of this problem. Thus, stochastic sampling approaches such as genetic or evolutionary algorithms (GA or EA) or simulated annealing (SA) are employed. To illustrate one such application we shall consider the GA-PLS method, which was implemented as follows (136). [Pg.61]

Therefore, the selection of the best subset model can be performed by forward stepwise selection starting from the variable with the lowest p-value (the current model) next each of the variables not yet included in the current model is added to it in turn, producing a set of candidates with corresponding p-values. The candidate model with the lowest p-value is selected and the process is repeated on the new current model. [Pg.472]

A ranking model is a relationship between a set of dependent attributes, experimentally investigated, and a set of independent attributes, i.e. model variables. As in regression and classification models the variable selection is one of the main step to find predictive models. In the present work, the Genetic Algorithm (GA-VSS) approach is proposed as the variable selection method to search for the best ranking models within a wide set of predictor variables. The ranking based on the selected subsets of variables is... [Pg.181]

In the first case, each variable not selected means a reduction in terms of cost and/or analysis time. The variable selection should therefore always be made on a cost/benefit basis, looking for the subset of variables leading to the best compromise between performance of the model and cost of the analyses. This means that, in the presence of groups of useful but highly correlated (and therefore redundant) variables, only one variable per group should be retained. With such data sets, it is also possible that a subset of variables giving a slightly worse result is preferred, if the reduction in performance is widely compensated by a reduction in cost or time. [Pg.237]

We are particularly interested in finding, for a given kpredicting functions (best subset selection, BSS). The trivial solution of this problem is to search all fc-subsets, determine a predicting function for each, and then select the best of these. However, this requires a high computational effort. There are Unear algebra techniques that can be used to minimize the effort in the case of linear regression [82]. Nevertheless, often it is impossible to search all subsets in reasonable time. [Pg.230]

The conclusion that might be drawn is that response variables are adequately described without information from the lead variables, as the lead variables were not selected to the best subset of regressors in either year 1 or year 2. The fact that the selection order of the best regressors was identical for the three response variables at each year indicates that the explanatory variables have equivalent effects on both mental and psychomotor functioning. The decrease in the effectiveness of birth weight over the first 2 years was coupled with an increasing importance of home environment, effects which have been suggested before in the developmental literature. [Pg.376]

It is clear from the review of the data from the first 2 years of this study that the response variables are ideally described without recourse to information from the lead variables, since these were not selected from the eight possible explanatory variables for the best subset of regressors in either year 1 or year 2. The best regressors were as expected, selected from home environment, socioeconomic factors and birth weight. This may have been the result of the sharp fall in the water supply lead exposure, which took place shortly after these children were born. This would suggest that prenatal and early postnatal exposure to lead is less critical than continuing exposure over a period of years into early childhood. [Pg.377]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...