Regression variable selection

To benchmark our learning methodology with alternative conventional approaches, we used the same 500 (x, y) data records and followed the usual regression analysis steps (including stepwise variable selection, examination of residuals, and variable transformations) to find an approximate empirical model, / (x), with a coefficient of determination = 0.79. This model is given by... [Pg.127]

Closely related to the creation of regression models by OLS is the problem of variable selection (feature selection). This topic is therefore presented in Section 4.5, although variable selection is also highly relevant for other regression methods and for classification. [Pg.119]

The following three performance measures are commonly used for variable selection by stepwise regression or by best-subset regression. An example in Section 4.5.8 describes use and comparison of these measures. [Pg.129]

An often-used version of stepwise variable selection (stepwise regression) works as follows Select the variable with highest absolute correlation coefficient with the y-variable the number of selected variables is mo= 1. Add each of the remaining x-variables separately to the selected variable the number of variables in each subset is nii = 2. Calculate F as given in Equation 4.44,... [Pg.154]

A simple strategy for variable selection is based on the information of other multivariate methods like PCA (Chapter 3) or PLS regression (Section 4.7). These methods form new latent variables by using linear combinations of the regressor... [Pg.157]

In Section 4.8.2, we will describe a method called Lasso regression. Depending on a tuning parameter, this regression technique forces some of the regression coefficients to be exactly zero. Thus the method can be viewed as a variable selection method where all variables with coefficients different from zero are in the final regression model. [Pg.157]

For each chromosome (variable subset), a so-called fitness (response, objective function) has to be determined, which in the case of variable selection is a performance measure of the model created from this variable subset. In most GA applications, only fit-criteria that consider the number of variables are used (AIC, BIC, adjusted R2, etc.) together with fast OLS regression and fast leave-one-out CV (see Section 4.3.2). Rarely, more powerful evaluation schemes are applied (Leardi 1994). [Pg.157]

The crucial point for building a prediction model with PCR (Section 4.6) is to determine the number of PCs to be used for prediction. In principle we could perform variable selection on the PCs, but for simplicity we limit ourselves to finding the appropriate number of PCs with the largest variances that allows the probably best prediction. In other words, the PCs are sorted in decreasing order according to their variance, and the prediction error for a regression model with the first a components will tell us which number of components is optimal. As discussed in Section 4.2... [Pg.187]

An exhaustive search for an optimal variable subset is impossible for this data set because the number of variables is too high. Even an algorithm like leaps-and-bound cannot be applied (Section 4.5.4). Instead, variable selection can be based on a stepwise procedure (Section 4.5.3). Since it is impossible to start with the full model, we start with the empty model (regress the y-variable on a constant), with the scope... [Pg.196]

A further point for the choice of an appropriate method is, whether the final model needs to be interpreted. In this case, and especially in case of many original regressor variables, the interpretation is in general easier if only a few regressor variables are used for the final model. This can be achieved by variable selection (Section 4.5), Lasso regression (Section 4.8.2), or regression trees (Section 4.8.3.3). Also PCR (Section 4.6) or PLS (Section 4.7) can lead to components that can be interpreted if they summarize the information of thematically related x-variables. [Pg.203]

Broadhurst, D., Goodacre, R., Jones, A., Rowland, J. J., Kell, D. B. Anal. Chim. Acta 348, 1997, 71-86. Genetic algorithms as a method for variable selection in multiple Unear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. [Pg.204]

R. Heikka, P. Minkkinen, and V-M Taavitsainen, Comparison of variable selection and regression methods in multivariate calibration of a process analyzer. Process Control and Quality, 6, 47-54 (1994). [Pg.435]

Multiple Enear regression with variable selection makes the matrix inversion possible by selecting a subset of the original variables. Both PCR and PLS reduce the number of variables by calculating linear combinations of the original variables (factors) and using a small enough number of these factors to allow for the matrix inversion. [Pg.130]

As discussed in the introduction, the solution of the inverse model equation for the regression vector involves the inversion of R R (see Equation 5 23). In many anal al chemistry experiments, a large number of variables are measured and R R cannot be inverted (i.e., it is singular). One approach to solving this problem is called stepwise MLR where a subset of variables is selected such that R R is not singular. There must be at least as many variables selected as there are chemical components in the system and these variables must represent different sources of variation. Additional variables are required if there are other soairces of variation (chemical or physical) that need to be modeled. It may also be the case that a sufficiently small number of variables are measured so that MIR can be used without variable selection. [Pg.130]

Model and Parameter Sta stics (Model Diagnostic) Table 5-13 displays the variables selected for a model constructed to predict caustic. The table lists summary statistics for the regression model as weU as information about the estimated regression coefficients. Six variables in addition to an intercept are found to be significant at the 95% confidence level. [Pg.140]

Some of the earliest applications of chemometrics in PAC involved the use of an empirical variable selection technique commonly known as stepwise multiple linear regression (SMLR).8,26,27 As the name suggests, this is a technique in which the relevant variables are selected sequentially. This method works as follows ... [Pg.243]

One particular challenge in the effective use of MLR is the selection of appropriate X-variables to use in the model. The stepwise and APC methods are some of the most common empirical methods for variable selection. Prior knowledge of process chemistry and dynamics, as well as the process analytical measurement technology itself, can be used to enable a priori selection of variables or to provide some degree of added confidence in variables that are selected empirically. If a priori selection is done, one must be careful to select variables that are not highly correlated with one other, or else the matrix inversion that is done to calculate the MLR regression coefficients (Equation 8.24) can become unstable, and introduce noise into the model. [Pg.255]

As for sample selection, I will submit two different methods for variable selection one that is relatively simple and one that is more computationally intensive. The simpler method68 involves a series of linear regressions of each X-variable to the property of interest. The relevance of an X-variable is then expressed by the ratio of the linear regression slope (b) to the variance of the elements in the linear regression model residual... [Pg.314]

The advantage of the GA variable selection approach over the univariate approach discussed earlier is that it is a true search for an optimal multivariate regression solution. One disadvantage of the GA method is that one must enter several parameters before it... [Pg.315]

Heikka, R., Minkkinen, P. and Taavitsainen, V.-M., Comparison of Variable Selection and Regression Methods in Multivariate Calibration of a Process Analyzer Process Contr. Qual. 1994, 6, 47-54. [Pg.325]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...