Reducing the Dimensionality of a Data Set

The dimensionality of a data set is the number of variables that are used to describe each object. For example, a conformation of a cyclohexane ring might be described in terms of the six torsion angles in the ring. However, it is often found that there are significant correlations between these variables. Under such circumstances, a cluster analysis is often facilitated by reducing the dimensionality of a data set to eliminate these correlations. Principal components analysis (PCA) is a commonly used method for reducing the dimensionality of a data set. [Pg.497]

The principal components are calculated using standard matrix techniques [Chatfield and Collins 1980]. The first step is to calculate the variance-covariance matrix. If there are s observations, each of which contains v values, then the data set can be represented as a matrix D with v rows and s columns. The variance-covariance matrix Z is [Pg.498]

As an example of the application of principal components analysis, we shall consider the conformations adopted by the five-membered ribose ring in our set of conformations [Pg.498]

Fig 9 33. Scatterplot of the first two principal components for the ring torsion angles T3-T7 [Pg.499]

The central idea of PCA is to reduce the dimensionality of a data set that may consist of a large number of interrelated variables while retaining as much as possible of the variation present in the data set [317-320]. [Pg.268]

Principal components are linear combinations of random or statistical variables, which have special properties in terms of variances. The central idea of PCA is to reduce the dimensionality of a data set that may consist of a large number of interrelated variables while retaining as much as possible of the variation present in the data set. This is achieved by transforming the PCs which are uncorrelated into a new set of variables which are ordered so that the first few retain most of the variation present in all of the original variables [292-295]. [Pg.357]

PCA is primarily a mathematical method for data reduction and it does not assume that the data have any particular distribution. We have seen how PCA can be used to reduce the dimensionality of a data set and how it may thus reveal clusters. It has been used, for example, on the results of Fourier transform spectroscopy in order to reveal differences between hair from different racial groups and for classifying different types of cotton fibre. In another example the concentrations of a number of chlorobiphenyls were measured in specimens from a variety of marine mammals. A PCA of the results revealed differences between species, differences between males and females, and differences between young and adult individuals. PCA also finds application in multiple regression (see Section 8.10). [Pg.219]

Evaluation of the statistical properties is a fundamental part of any statistical analysis and here we concentrated on the distribution of each variable. To reduce the dimensionality of this data set we used principal component analysis (PCA) to explore the covariance structure of these data and to reduce the variables to a more manageable number (PAl method with no rotation, 21). [Pg.150]

Principal component analysis is a simple vector space transform, allowing the dimensionality of a data set to be reduced, while at the same time minimizing... [Pg.130]

Recall that, in order to generate an ILS calibration, we must have at least as many samples as there are wavelengths used in the calibration. Since we only have 15 spectra in our training sets but each spectrum contains 100 wavelengths, we were forced to find a way to reduce the dimensionality of our spectra to 15 or less. We have seen that principal component analysis (PCA) provides us with a way of optimally reducing the dimensionality of our data without degrading it, and with the added benefit of removing some noise. [Pg.99]

Principal Component Analysis (PCA) is the most popular technique of multivariate analysis used in environmental chemistry and toxicology [313-316]. Both PCA and factor analysis (FA) aim to reduce the dimensionality of a set of data but the approaches to do so are different for the two techniques. Each provides a different insight into the data structure, with PCA concentrating on explaining the diagonal elements of the covariance matrix, while FA the off-diagonal elements [313, 316-319]. Theoretically, PCA corresponds to a mathematical decomposition of the descriptor matrix,X, into means (xk), scores (fia), loadings (pak), and residuals (eik), which can be expressed as... [Pg.268]

Among the mathematical tools to investigate patterns and clustering behaviour in data sets, two techniques are widely established, namely principal component analysis and cluster analysis. Both can be used to reduce the dimensionality of a problem. Or in other words, cluster analysis can be used for variable or descriptor selection from a larger set. On the other hand cluster analysis may be used to investigate similarity among compounds. Cluster analysis is often used complementary to PCA. [Pg.365]

Pranckeviciene et al.11 have assessed the NMR spectra of pathogenic fungi and of human biofluids, finding the spectral signature that comprises a set of attributes that serve to uniquely identify and characterize the sample. This use of GAs effectively reduces the dimensionality of the data, and it can speed up later processing as well as make it more reliable. [Pg.363]

Factor The result of a transformation of a data matrix where the goal is to reduce the dimensionality of the data set. Estimating factors is necessary to construct principal component regression and partial least-squares models, as discussed in Section 5.3.2. (See also Principal Component.)... [Pg.186]

Principal components analysis is used to obtain a lower dimensional graphical representation which describes a majority of the variation in a data set. With PCA, a new set of axes arc defined in which to plot the samples. They are constructed so that a maximum amount of variation is described with a minimum number of axes. Because it reduces the dimensions required to visualize the data, PCA is a powerftil method for studying multidimensional data sets. [Pg.239]

The PCA process reduces the dimensionality of the data set. For example, consider the spectra of five mixtures of solutions of the chemical warfare agent ethyl-A A-dimethylphosphoroamidocyanidate (GA) in water shown on the left side in Fig. 5-8. The spectra of the pure components are shown... [Pg.279]

PCA is simply a method for reducing the dimensionality of the data set and for removing dependent data (8). Although each of the five mixture spectra in Fig. 5-8 contain almost 4,000 data points, each can be expressed as a sum of two spectra (loadings or pure) containing 4,000 data points each. Thus the dimensionality is reduced from 5 x 4,000, or 20,000 data points, to 2 x 4,000 + 10, or 8,010 values (the value of 10 is for the two coefficients... [Pg.280]

An excessive number of input units not only results in poor generalization, but also slows network training. Reducing the number of input variables often leads to improved performance of a given data set, even though information is being discarded. Thus, one important role of pre-processing is to reduce the dimensionality of the input data. [Pg.84]

Ordination. Ordination procedures place a sample data point in a variable space to represent some trend or variation. No assumptions need to be made regarding the number of groups. A simple type of ordination would be to plot the coordinates of a sample relative to two variables as in Figure 2. For p-variables, higher dimensionality prohibits easy inspection, so most ordination techniques attempt to summarize the information within a data set and reduce the dimensionality (e.g., Figure 1). [Pg.67]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...