Multivariate data feature selection

The essential goal of the handling of multivariate data is to reduce the number of dimensions. This is not achieved by selecting the most suitable pair of features, but by computation of new coordinates by appropriate transformation of the feature space. In most of the cases the new variables Z are determined by linear combination of the... [Pg.46]

Fig. 3.2 Typical applications using chemical multivariate data (schematically shown for 2-dimensional data) cluster analysis (a) separation of categories (b), discrimination by a decision plane and classification of unknowns (c) modelling categories and principal component analysis (d), feature selection (X2 is not relevant for category separation) (eY relationship between a continuous property Y and the features Xi and X2 (f)...

Self-organizing maps in conjunction with principal component analysis constitute a powerful approach for display and classification of multivariate data. However, this does not mean that feature selection should not be used to strengthen the classification of the data. Deletion of irrelevant features can improve the reliability of the classification because noisy variables increase the chances of false classification and decrease classification success rates on new data. Furthermore, feature selection can lead to an understanding of the essential features that play an important role in governing the behavior of the system or process under investigation. It can identify those measurements that are informative and those measurements that are uninformative. However, any approach used for feature selection should take into account the existence of redundancies in the data and be multivariate in nature to ensure identification of all relevant features. [Pg.371]

Lasch and cowoikers describe in Chap. 8 their group s efforts to improve taxonomic resolution without compromising the simplicity and the speed of MALDI TOF MS. Such improvements may be achieved by signature database expansion with novel and diverse strains, optimization, and standardization of sample preparation and data-acquisition protocols. Further enhancement in data analysis pipelines including more advanced spectral preprocessing, feature selection, and supervised methods of multivariate classification analysis also contribute to taxonomic resolution enhancements. Strains of Staphylococcus aureus. Enterococcus faecium, and Bacillus cereus are selected to illustrate aspects of that strategy. [Pg.5]

Type 1 multivariate data contain only x-data. The typical aims of data interpretation are getting an insight into the data structure, searching for clusters containing similar objects, selecting relevant features, and detecting outliers. This type of data evaluation is often addressed by the terms exploratory data analysis or unsupervised learning. [Pg.348]

PCA is by far the most important method in multivariate data analysis and has two main applications (a) visualization of multivariate data by scatter plots as described above (b) data reduction and transformation, especially if features are highly correlating or noise has to be removed. For this purpose instead of the original p variables X a subset of uncorrelated principal component scores U can be used. The number of principal components considered is often determined by applying a threshold for the score variance. For instance, only principal components with a variance greater than 1% of the total variance may be selected, while the others are considered as noise. The number of principal components with a non-negligible variance is a measure for the intrinsic dimensionality of the data. As an example consider a data set with three features. If all object points are situated exactly on a plane, then the intrinsic dimensionality is two. The third principal component in this example has a variance of zero. Therefore two variables (the scores of PCI and PC2) are sufficient for a complete description of the data structure. [Pg.352]

The pool of descriptors that is calculated must be winnowed down to a manageable set before constructing a statistical or neural network model. This operation is called feature selection. The first step of feature selection is to use a battery of objective statistical methods. Descriptors that contain little information, descriptors that have little variation across the data set, or descriptors that are highly correlated with other descriptors are candidates for elimination. Multivariate correlations among descriptor can also be discovered with multiple linear regression analysis, and these relationships can be broken down by elimination of descriptors. [Pg.2325]

Sets of spectroscopic data (IR, MS, NMR, UV-Vis) or other data are often subjected to one of the multivariate methods discussed in this book. One of the issues in this type of calculations is the reduction of the number variables by selecting a set of variables to be included in the data analysis. The opinion is gaining support that a selection of variables prior to the data analysis improves the results. For instance, variables which are little or not correlated to the property to be modeled are disregarded. Another approach is to compress all variables in a few features, e.g. by a principal components analysis (see Section 31.1). This is called... [Pg.550]

Interestingly, there is no literature on the effect of sample size on biomarker identification in the "omics" sciences, and the objective of this contribution is to fill this gap. We focus on a two-class problem, and in particular on small data sets. In our approach, real class differences have been introduced by spiking apple extracts with selected compounds, analyzing them using UPLC-TOF mass spectrometry, and comparing the feature lists to those of unspiked apple extracts. Using these data we are able to run a comparison between two multivariate... [Pg.142]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...