Supervised variable selection

With the exception of those variables having zero variance (which pick themselves), the decision about which variables to eliminate/include and the method by which this is done depends on several factors. The two most important factors are whether the dataset consists of two blocks of variables, a response block (Y) and a descriptor/predictor block (X), and whether the purpose of the analysis is to predict/describe values for one or more of the response variables from a model relating the variables in the two blocks. If this result is indeed the aim of the analysis, then it seems reasonable that the choice of variables to be included should depend, to some extent, on the response variable or variables being modeled. This approach is referred to as supervised variable selection. On the other hand, if the variable set consists of only one block of variables, the choice of variables in any analysis will be done with what are referred to as unsupervised variable selection. [Pg.307]

The number of papers published in the literature relating to supervised variable selection methods is vast. Its obvious association with the ubiquitous publication of regression modeling means that trying to provide an exhaustive overview of the literature is an impossible task. Although this chapter aims to review variable selection methods, all that we can hope to achieve realistically is a broad coverage of the more obvious source methods and report here techniques that are sound from a statistical point of view and for which the software is available... [Pg.309]

This procedure can be performed either with item 1, in which case it is considered to be a supervised variable selection meaning that the response variable has selected variables, or without item 1, relegating it to an unsupervised selection category. [Pg.334]

Adopting the unsupervised option initially, the first two variables to be selected are those with the lowest pairwise correlation. The next variable selected has the smallest multiple squared correlation with those first two variables. This process is continued until the preset maximum level of multicolinearity (determined by the squared multiple correlation coefficient) is reached. Whitley et al. refer to this procedure as unsupervised forward selection (UFS). UFS can also be performed with a minimum variance criterion where only variables with variance above this minimum will be selected. These two criteria can be used by scientists simultaneously. With supervised variable selection, only those variables having a sufficiently high correlation with the response are considered for what effectively is UFS on this reduced set of variables. We will term this latter process, supervised forward selection (SFS). To see how these options work and to examine the effect they have on the model produced, we performed PLS on the data with both UFS and SFS configured to run with a range of response variable correlations (Table 8). [Pg.335]

A set of molecules is commonly described with anywhere from 4 to 10,000 descriptors. It is also possible to represent molecules with sparse descriptors numbering up to 2 million. Variable selection, or descriptor subset selection, or descriptor validation, is important, whether the context is supervised or unsupervised learning (Section 6). [Pg.79]

Supervised variable elimination might also be regarded as variable selection. Whether we consider this to be the third major section of how to treat multivariate datasets is a matter of semantics, however. It is possible to eliminate variables in a supervised manner rather than to select them. One obvious way is to eliminate variables that have a zero or very low correlation with the response variable or variables. In the case of classified response data, this selection means those descriptors that have the same distribution (mean and standard deviation) for the two or more classes. The danger in this selection process is the possibility that a variable might have a low correlation with the response but contribute to a multivariate correlation. Although this is possible, in practice, it is unlikely. [Pg.309]

LDA models may also be used to identify important variables but here it should be remembered that discriminant functions are not unique solutions. Thus, the use of LDA for variable selection may be misleading. Whatever form of supervised learning method is used for the identification... [Pg.159]

The selection of variables could separate relevant information from unwanted variability and at the same time allows data compression, that is more parsimonious models, simplification or improvement of model interpretation, and so on. Although many approaches can be used for features selection, in this work, a wavelet-based supervised feature selection/classification algorithm, WPTER [12], was applied. The best performing model was obtained using a daubechies 10 wavelet, a maximum decomposition level equal to 10, between-class/within-class variance ratio criterion for the thresholding operation and the percentage of selected coefficients equal to 2%. Six wavelet coefficients were selected, belonging to the 4th, 5th, 6th, 8th, and 9th levels of decomposition. [Pg.401]

Most of the supervised pattern recognition procedures permit the carrying out of stepwise selection, i.e. the selection first of the most important feature, then, of the second most important, etc. One way to do this is by prediction using e.g. cross-validation (see next section), i.e. we first select the variable that best classifies objects of known classification but that are not part of the training set, then the variable that most improves the classification already obtained with the first selected variable, etc. The results for the linear discriminant analysis of the EU/HYPER classification of Section 33.2.1 is that with all 5 or 4 variables a selectivity of 91.4% is obtained and for 3 or 2 variables 88.6% [2] as a measure of classification success. Selectivity is used here. It is applied in the sense of Chapter... [Pg.236]

For the example in Fig. 2, the Fourier transformed NMR spectra (variables or descriptors being intensity as a function of frequency) were utilized for the creation of the data matrix D. It should be noted that many different descriptors can be used to create D, with the descriptor selection depending on the analysis method and the information to be extracted. For example, in the spectral resolution methods (Section 6), the desired end result is the determination of the true or pure component spectra and relative concentrations present within the samples or mixtures [Eq. (4)]. For this case, the unmodified real spectra Ij co) are commonly used for the chemometric analysis. In contrast, for the non-supervised and supervised methods described in Sections 3 and 4, the classification of a sample into different categories is the desired outcome. For these types of non-supervised and supervised methods the original NMR spectrum can manipulated or transformed to produce new descriptors including... [Pg.46]

The performance of PC-LDA is very much dependent on the number of components taken into accoxmt. This behavior can be explained by considering that in PC-LDA the variable reduction step is performed without any knowledge of class labels, only selecting the directions of greater variance. If these directions show little discriminating power, their supervised linear combination leads to poor modeling. However, performance improves with... [Pg.151]

In all that we have said so far the choice of the set of p variables to be included in the regression has been made independently of the response variable. If this is not the case, then another type of bias is introduced. This type of bias is referred to as competition or selection bias. It occurs when the variables to be included in the model are chosen in a supervised manner to maximize some function involving the response variable. One such function is R, the proportion of the total sum of squares of Y explained by the regression. To understand this form of bias, consider a simple situation in which several regressions were constructed that contain only one descriptor variable each. The data and regressions were constructed as follows ... [Pg.320]

Select variables that have significant correlations with the response variable above some lower bound (supervised selection)... [Pg.334]

This chapter has described some of the more commonly used supervised learning methods for the analysis of data discriminant analysis and its relatives for classified dependent data, variants of regression analysis for continuous dependent data. Supervised methods have the advantage that they produce predictions, but they have the disadvantage that they can suffer from chance effects. Careful selection of variables and test/training sets, the use of more than one technique where possible, and the application of common sense will all help to ensure that the results obtained from supervised learning are useful. [Pg.160]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...